Bug #5956
closed[Deployment] Docker configuration changes restart Docker on compute nodes, interrupting running jobs
Description
Two types of errors causing long-running pipeline jobs to fail. The jobs fail on different inputs. The priority here is high because the outputs of this pipeline are needed for the paper.
The first failed on the 121st file: data_HG01927_cg_data_ASM_blood_var-GS000013202-ASM.fj (no such file error)
The second failed at the very beginning of the job (no such file error)
The third failed on the 79th file: data_HG00663_cg_data_ASM_lcl_var-GS000016983-ASM.fj (no such file error)
Type 1:
2015-05-07_22:28:38 su92l-8i9sb-d3zljjdswotmscp 13424 602 stderr time="2015-05-07T22:28:38Z" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/6cac91fd0d4f232ff19350b1017598fa1a9ee381e414909a6b1d3eb82384d7a9/wait: write unix /var/run/docker.sock: broken pipe. Are you trying to connect to a TLS-enabled daemon without TLS?"
Type 2:
2015-05-07_22:35:39 su92l-8i9sb-d3zljjdswotmscp 13424 599 stderr time="2015-05-07T22:35:39Z" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/97d5c61fc5fc9ae76f06db990f95c5df479658d69cfc6737387c98d5ee0957ac/wait: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
Pipeline instance this failed on is:
https://workbench.su92l.arvadosapi.com/pipeline_instances/su92l-d1hrv-kqz04a5cri08gsc
Related issues
Updated by Brett Smith almost 9 years ago
- Subject changed from /var/run/docker.sock not found/broken pipe causing tasks to fail in long-running jobs to [Deployment] Docker configuration changes restart Docker on compute nodes, interrupting running jobs
- Category set to Deployment
- Status changed from Closed to New
This happened because of a Docker configuration change (to fix a bug reported by another user). Right now our configuration management systems check for changes once an hour, and restart Docker as soon as there's any change. This interrupts running jobs, so we need to figure out a smoother process to handle these changes.
The good news is that we don't currently have any other Docker changes on the radar, so you should be able to continue your work without worrying about this.
Updated by Radhika Chippada almost 9 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Tom Morris over 7 years ago
- Target version deleted (
Arvados Future Sprints)