Project

General

Profile

Actions

Bug #8148

open

[FUSE] When we give up trying to write a block, the next operation on the file should fail

Added by Bryan Cosca over 8 years ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

Original bug: https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-nn25na7iqrt6hnf failed like this:

2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr 2016-01-07 05:51:41 arvados.arvados_fuse[10518] ERROR: Keep write error: Error writing some blocks: block cc0082511a746b57deb1a2e34a68ccfb+67104768 raised KeepWriteError (failed to write cc0082511a746b57deb1a2e34a68ccfb (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block c8191a209f72c333a42b895ce9410985+67039232 raised KeepWriteError (failed to write c8191a209f72c333a42b895ce9410985 (wanted 2 copies but wrote 1): service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block f51d5c298bad31cdc1bcb59db43a7a98+67039232 raised KeepWriteError (failed to write f51d5c298bad31cdc1bcb59db43a7a98 (wanted 2 copies but wrote 1): service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block 220ddadcf23d10bc4c82c45c3d148223+67100672 raised KeepWriteError (failed to write 220ddadcf23d10bc4c82c45c3d148223 (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block 94d242dc2c3380a005fb042e8e7835bf+67039232 raised KeepWriteError (failed to write 94d242dc2c3380a005fb042e8e7835bf (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block 94a3a5a4299ce918ee879bb384763bd6+67039232 raised KeepWriteError (failed to write 94a3a5a4299ce918ee879bb384763bd6 (wanted 2 copies but wrote 1): service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015

Diagnosis: At the time the job started, there were four Keep disk services, and three of them were full (or close enough that they couldn't hold the job's output data as it wrote to FUSE). As it wrote new blocks to Keep, block writes started failing because it was impossible to write two copies to Keep disks. Fair enough.

While the job and mount were running, sysadmins added a fifth Keep service with plenty of space. This made it harder to diagnose what happened: there appeared to be enough space on the cluster.

When FUSE gives up trying to write a block, the next I/O operation on the file should fail, to avoid situations like this where there's a substantial disconnect between the time of the problem and the time of the report.


Related issues

Related to Arvados - Idea #8539: [SDKs/FUSE] Better retry defaultsResolvedActions
Actions

Also available in: Atom PDF