Project

General

Profile

Actions

Bug #10585

closed

crunch doesn't end jobs when their arv-mount dies

Added by Joshua Randall over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.0

Description

We have slowly been accumulating a large number of jobs across our cluster that are still "running" although they are not doing anything. This appears to be because their corresponding arv-mount process has died for some reason (if one does a `docker exec` to get into the crunch container, `ls /keep` says "Transport endpoint not connected":

crunch@4e50cdfd50db:/tmp/crunch-job$ ls /keep
ls: cannot access /keep: Transport endpoint is not connected

My expectation would be that if arv-mount dies, the crunch container should be destroyed so that the resources can be freed up.


Subtasks 3 (0 open3 closed)

Task #10755: Add testResolvedTom Clegg11/22/2016Actions
Task #10725: crunchstat option to kill child when parent diesResolvedTom Clegg11/22/2016Actions
Task #10730: Review 10585-crunchstat-lost-parentResolvedLucas Di Pentima11/22/2016Actions

Related issues

Related to Arvados - Bug #10586: Python keep client (CollectionWriter) appears to deadlockResolvedTom Clegg11/22/2016Actions
Related to Arvados - Bug #10777: [Crunch2] crunch-run: stop the container and fail if arv-mount dies before the container finishesResolvedTom Clegg02/24/2017Actions
Actions

Also available in: Atom PDF