Project

General

Profile

Bug #8373

Updated by Peter Amstutz about 8 years ago

When a task fails with a keep error, it is supposed to tempfail and get restarted.    However the code in crunch-job does this: 

 <pre> 
     elsif ($line =~ /arvados\.errors\.Keep/) { 
       $jobstep[$job]->{tempfail} = 1; 
     } 
 </pre> 

 This won't match an error emitted by FUSE: 

 <pre> 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr 2016-02-04 12:55:10 arvados.arvados_fuse[6520] ERROR: Unhandled exception during FUSE operation 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr Traceback (most recent call last): 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 277, in catch_exceptions_wrapper 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr       return orig_func(self, *args, **kwargs) 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 521, in read 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr       r = handle.obj.readfrom(off, size, self.num_retries) 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/fusefile.py", line 55, in readfrom 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr       return self.arvfile.readfrom(off, size, num_retries, exact=True) 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 828, in readfrom 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr       block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact)) 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 614, in get_block_contents 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr       return self._keep.get(locator, num_retries=num_retries) 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr       return orig_func(self, *args, **kwargs) 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 980, in get 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr       "failed to read {}".format(loc_s), service_errors, label="service") 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr KeepReadError: failed to read 32527d1c5562cad5afa6e119f99c0cdf+67108864+A3f7385e4546387908af0374653f90c59e6dd908c@56c56ff4: service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep2.wx7k5.arvadosapi.com port 25107: Connection refused'); service http://keep3.wx7k5.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep3.wx7k5.arvadosapi.com port 25107: Connection refused'); service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep6.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep5.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep8.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep7.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 
 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr  
 </pre>

Back