Bug #5956
closed[Deployment] Docker configuration changes restart Docker on compute nodes, interrupting running jobs
Description
Two types of errors causing long-running pipeline jobs to fail. The jobs fail on different inputs. The priority here is high because the outputs of this pipeline are needed for the paper.
The first failed on the 121st file: data_HG01927_cg_data_ASM_blood_var-GS000013202-ASM.fj (no such file error)
The second failed at the very beginning of the job (no such file error)
The third failed on the 79th file: data_HG00663_cg_data_ASM_lcl_var-GS000016983-ASM.fj (no such file error)
Type 1:
2015-05-07_22:28:38 su92l-8i9sb-d3zljjdswotmscp 13424 602 stderr time="2015-05-07T22:28:38Z" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/6cac91fd0d4f232ff19350b1017598fa1a9ee381e414909a6b1d3eb82384d7a9/wait: write unix /var/run/docker.sock: broken pipe. Are you trying to connect to a TLS-enabled daemon without TLS?"
Type 2:
2015-05-07_22:35:39 su92l-8i9sb-d3zljjdswotmscp 13424 599 stderr time="2015-05-07T22:35:39Z" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/97d5c61fc5fc9ae76f06db990f95c5df479658d69cfc6737387c98d5ee0957ac/wait: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
Pipeline instance this failed on is:
https://workbench.su92l.arvadosapi.com/pipeline_instances/su92l-d1hrv-kqz04a5cri08gsc
Related issues