Project

General

Profile

Actions

Bug #5956

closed

[Deployment] Docker configuration changes restart Docker on compute nodes, interrupting running jobs

Added by Sarah Guthrie almost 9 years ago. Updated over 7 years ago.

Status:
Duplicate
Priority:
High
Assigned To:
-
Category:
Deployment
Target version:
-
Story points:
-

Description

Two types of errors causing long-running pipeline jobs to fail. The jobs fail on different inputs. The priority here is high because the outputs of this pipeline are needed for the paper.

The first failed on the 121st file: data_HG01927_cg_data_ASM_blood_var-GS000013202-ASM.fj (no such file error)
The second failed at the very beginning of the job (no such file error)
The third failed on the 79th file: data_HG00663_cg_data_ASM_lcl_var-GS000016983-ASM.fj (no such file error)

Type 1:

2015-05-07_22:28:38 su92l-8i9sb-d3zljjdswotmscp 13424 602 stderr time="2015-05-07T22:28:38Z" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/6cac91fd0d4f232ff19350b1017598fa1a9ee381e414909a6b1d3eb82384d7a9/wait: write unix /var/run/docker.sock: broken pipe. Are you trying to connect to a TLS-enabled daemon without TLS?"

Type 2:

2015-05-07_22:35:39 su92l-8i9sb-d3zljjdswotmscp 13424 599 stderr time="2015-05-07T22:35:39Z" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/97d5c61fc5fc9ae76f06db990f95c5df479658d69cfc6737387c98d5ee0957ac/wait: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"

Pipeline instance this failed on is:

https://workbench.su92l.arvadosapi.com/pipeline_instances/su92l-d1hrv-kqz04a5cri08gsc


Related issues

Related to Arvados - Bug #7481: Docker Daemon failure or FUSE problemDuplicate10/08/2015Actions
Has duplicate Arvados - Bug #5959: Failed Jobs on 5/7 (Docker issues?)Closed05/08/2015Actions
Actions #1

Updated by Brett Smith almost 9 years ago

  • Status changed from New to Closed
Actions #2

Updated by Brett Smith almost 9 years ago

  • Subject changed from /var/run/docker.sock not found/broken pipe causing tasks to fail in long-running jobs to [Deployment] Docker configuration changes restart Docker on compute nodes, interrupting running jobs
  • Category set to Deployment
  • Status changed from Closed to New

This happened because of a Docker configuration change (to fix a bug reported by another user). Right now our configuration management systems check for changes once an hour, and restart Docker as soon as there's any change. This interrupts running jobs, so we need to figure out a smoother process to handle these changes.

The good news is that we don't currently have any other Docker changes on the radar, so you should be able to continue your work without worrying about this.

Actions #3

Updated by Radhika Chippada almost 9 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints
Actions #4

Updated by Brett Smith almost 9 years ago

  • Status changed from New to Duplicate

Duplicates #4300.

Actions #5

Updated by Tom Morris over 7 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF