Feature #13403
[crunch-run] Cancel container on FUSE error
Start date:
Due date:
% Done:
0%
Estimated time:
Story points:
2.0
Description
https://dev.arvados.org/issues/13377#note-4
The fact that we kept the container alive for 7 hours retrying seems like a problem, too. In the context of a container, if arv-mount gives up on a fuse request and returns an error analogous to "filesystem is corrupt / disk is dead" to the caller, should we automatically fail the container?
If arv-mount has a block read error (or really anything that will get turned into EIO by FUSE), crunch-run should cancel the container (the container request may retry by existing logic, though).
Proposed design:
- arv-mount emits a well known error string when it returns a major file system error (EIO)
- crunch-run monitors arv-mount output and looks for major file system errors or out or memory error (MemoryError)
- on seeing the error from arv-mount, crunch-run logs an error of its own and cancels the container.
History
#1
Updated by Peter Amstutz almost 3 years ago
- Status changed from New to In Progress
#2
Updated by Peter Amstutz almost 3 years ago
- Description updated (diff)
#4
Updated by Peter Amstutz almost 3 years ago
- Status changed from In Progress to New
#5
Updated by Peter Amstutz almost 3 years ago
- Description updated (diff)
#6
Updated by Tom Clegg almost 3 years ago
- Story points set to 2.0
#7
Updated by Tom Morris almost 3 years ago
- Subject changed from Cancel container on FUSE error to [crunch-run] Cancel container on FUSE error
- Target version changed from To Be Groomed to Arvados Future Sprints
#9
Updated by Peter Amstutz over 2 years ago
- Description updated (diff)
#10
Updated by Peter Amstutz over 2 years ago
- Description updated (diff)