Project

General

Profile

Bug #20637

Updated by Peter Amstutz 11 months ago

User container workflow is getting FUSE errors, the underlying error message in arv-mount.txt is "Failed to connect to 172.17.0.1 port 36323: Connection refused" 

 In addition crunch-run.txt crunch-run is reporting "error updating log collection: error recording logs: Could not write sufficient replicas ... dial tcp 172.17.0.1:36323 conne"    (presumably connection refused but the message is truncated) 

 This is with a local compute node keepstore.    The keepstore service had to be working initially because it was able to load the docker image and write the initial log collection snapshot.    Subsequently it has not been able to update the log collection with the error above. 

 This suggests the keepstore service crashed.    startLocalKeepstore uses the health check to determine when the service has started, but does not set up an ongoing watchdog to ensure the service continues to be available. 

Back