Project

General

Profile

Actions

Bug #13164

closed

[API] dispatch sometimes tries to run cancelled containers at the expense of pending containers

Added by Joshua Randall about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

Sometimes the crunch-dispatch-slurm logs can be filled with messages such as:

Mar  1 19:54:11 arvados-master-ncucu crunch-dispatch-slurm[26949]: 2018/03/01 19:54:11 debug: error locking container ncucu-dz642-2gpgq1vw3dflmox: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk

This seems to be after a cwl runner job has been cancelled or after slurm container jobs are cancelled manually using `scancel`.

When in this state, crunch-dispatch-slurm seems to ignore most or all of the pending containers (sometimes for hours). However, if when in that condition it is restarted manually (i.e. by running `systemctl restart crunch-dispatch-slurm`) the new process will immediately submit to slurm all of the (sometimes many hundreds in our case) pending containers.


Subtasks 3 (0 open3 closed)

Task #13459: Review 13164-container-lockingResolvedLucas Di Pentima05/14/2018Actions
Task #13568: Review 13164-cr-lockingResolvedTom Clegg05/14/2018Actions
Task #13635: Review 13164-fix-zero-priority-after-raceResolvedTom Clegg05/14/2018Actions

Related issues

Related to Arvados - Bug #13500: crunch-dispatch-slurm PG::TRDeadlockDetected: ERROR: deadlock detectedClosedTom CleggActions
Related to Arvados - Idea #13574: [Controller] Update container priorities asynchronouslyNewActions
Related to Arvados - Bug #13594: PG::TRDeadlockDetected when running cwl tests in parallelResolvedTom CleggActions
Actions

Also available in: Atom PDF