Project

General

Profile

Feature #13403

Updated by Peter Amstutz almost 6 years ago

https://dev.arvados.org/issues/13377#note-4 

 > The fact that we kept the container alive for 7 hours retrying seems like a problem, too. In the context of a container, if arv-mount gives up on a fuse request and returns an error analogous to "filesystem is corrupt / disk is dead" to the caller, should we automatically fail the container? 

 If arv-mount has a block read error (or really anything that will get turned into EIO by FUSE), crunch-run should    cancel the container (the container request may retry by existing logic, though). 

 Proposed design: Design question: 

 * arv-mount emits a well known error string How does crunch-run detect when it returns a major file system error (EIO)  
 * crunch-run monitors arv-mount output and looks for the error string 
 * on seeing the error from arv-mount, crunch-run logs an error of its own and cancels the container. happens?    Log sniffing?

Back