Project

General

Profile

Bug #8148

Updated by Brett Smith about 8 years ago

Original bug: https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-nn25na7iqrt6hnf failed like this: 

 <pre> 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr 2016-01-07 05:51:41 arvados.arvados_fuse[10518] ERROR: Keep write error: Error writing some blocks: block cc0082511a746b57deb1a2e34a68ccfb+67104768 raised KeepWriteError (failed to write cc0082511a746b57deb1a2e34a68ccfb (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable); block c8191a209f72c333a42b895ce9410985+67039232 raised KeepWriteError (failed to write c8191a209f72c333a42b895ce9410985 (wanted 2 copies but wrote 1): service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable); block f51d5c298bad31cdc1bcb59db43a7a98+67039232 raised KeepWriteError (failed to write f51d5c298bad31cdc1bcb59db43a7a98 (wanted 2 copies but wrote 1): service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable); block 220ddadcf23d10bc4c82c45c3d148223+67100672 raised KeepWriteError (failed to write 220ddadcf23d10bc4c82c45c3d148223 (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable); block 94d242dc2c3380a005fb042e8e7835bf+67039232 raised KeepWriteError (failed to write 94d242dc2c3380a005fb042e8e7835bf (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable); block 94a3a5a4299ce918ee879bb384763bd6+67039232 raised KeepWriteError (failed to write 94a3a5a4299ce918ee879bb384763bd6 (wanted 2 copies but wrote 1): service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr     HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015 
 </pre> 

 Diagnosis: At the time the job started, there were four Keep disk services, and three of them were full (or close enough that they couldn't hold the job's output data as it wrote to FUSE).    As it wrote new blocks to Keep, block writes started failing because it was impossible to write two copies to Keep disks.    Fair enough. 

 While the job and mount were running, sysadmins added a fifth Keep service with plenty of space.    This made it harder It would've been good if arv-mount kept retrying to diagnose what happened: there appeared to be enough space on upload these blocks, including refreshing the cluster. 

 When FUSE gives up trying Keep services list, during its life.    If it had, it could've uploaded these blocks to write a block, the next I/O operation on the file should fail, to avoid situations like this where there's a substantial disconnect between the time of the problem new Keep service and succeeded in uploading the time of the report. collection.

Back