Bug #8148

[FUSE] When we give up trying to write a block, the next operation on the file should fail

Added by Bryan Cosca over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
01/07/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Original bug: https://workbench.wx7k5.arvadosapi.com/jobs/wx7k5-8i9sb-nn25na7iqrt6hnf failed like this:

2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr 2016-01-07 05:51:41 arvados.arvados_fuse[10518] ERROR: Keep write error: Error writing some blocks: block cc0082511a746b57deb1a2e34a68ccfb+67104768 raised KeepWriteError (failed to write cc0082511a746b57deb1a2e34a68ccfb (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block c8191a209f72c333a42b895ce9410985+67039232 raised KeepWriteError (failed to write c8191a209f72c333a42b895ce9410985 (wanted 2 copies but wrote 1): service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block f51d5c298bad31cdc1bcb59db43a7a98+67039232 raised KeepWriteError (failed to write f51d5c298bad31cdc1bcb59db43a7a98 (wanted 2 copies but wrote 1): service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block 220ddadcf23d10bc4c82c45c3d148223+67100672 raised KeepWriteError (failed to write 220ddadcf23d10bc4c82c45c3d148223 (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block 94d242dc2c3380a005fb042e8e7835bf+67039232 raised KeepWriteError (failed to write 94d242dc2c3380a005fb042e8e7835bf (wanted 2 copies but wrote 1): service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:41 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable); block 94a3a5a4299ce918ee879bb384763bd6+67039232 raised KeepWriteError (failed to write 94a3a5a4299ce918ee879bb384763bd6 (wanted 2 copies but wrote 1): service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015
2016-01-07_05:51:42 wx7k5-8i9sb-nn25na7iqrt6hnf 9798 2 stderr   HTTP/1.1 503 Service Unavailable; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 503 HTTP/1.1 100 Continue\015

Diagnosis: At the time the job started, there were four Keep disk services, and three of them were full (or close enough that they couldn't hold the job's output data as it wrote to FUSE). As it wrote new blocks to Keep, block writes started failing because it was impossible to write two copies to Keep disks. Fair enough.

While the job and mount were running, sysadmins added a fifth Keep service with plenty of space. This made it harder to diagnose what happened: there appeared to be enough space on the cluster.

When FUSE gives up trying to write a block, the next I/O operation on the file should fail, to avoid situations like this where there's a substantial disconnect between the time of the problem and the time of the report.


Related issues

Related to Arvados - Story #8539: [SDKs/FUSE] Better retry defaultsNew

History

#1 Updated by Bryan Cosca over 3 years ago

  • Description updated (diff)

#2 Updated by Brett Smith over 3 years ago

These services are reporting 503 because they're full. We recently added some space by adding a keep5.wx5k7, on the 6th. So the question is, why didn't this job see that?

#3 Updated by Brett Smith over 3 years ago

The job, and therefore its FUSE mount, started before we added keep5.wx7k5. Writing these blocks likely failed because it tried to write them before keep5.wx7k5 was available.

Bryan, this job is safe to restart immediately. The bug here is probably that FUSE needs to refresh the Keep service list and retry with new services to account for cases like this.

#4 Updated by Brett Smith over 3 years ago

  • Subject changed from keep1 keep2 keep3 503 error to [FUSE] When writing a block fails, keep retrying over the life of the process
  • Description updated (diff)

#5 Updated by Brett Smith over 3 years ago

  • Target version set to Arvados Future Sprints

#6 Updated by Brett Smith over 3 years ago

  • Subject changed from [FUSE] When writing a block fails, keep retrying over the life of the process to [FUSE] When we give up trying to write a block, the next operation on the file should fail
  • Description updated (diff)

Also available in: Atom PDF