Project

General

Profile

Actions

Bug #18026

closed

[singularity] crunch-run image cache race condition

Added by Ward Vandewege over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-

Description

This is from https://dev.arvados.org/issues/17755#note-26: =====================================================================================================

The first two failures I think are due to a race condition between two crunch-run processes trying to convert and caching the singularity image.

This attempt found the collection, but apparently before the cached image had been created. (not supposed to happen.)

2021-08-06T21:26:16.608212665Z Using Docker image id "sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7"
2021-08-06T21:26:16.608240987Z Loading Docker image from keep
2021-08-06T21:26:17.188261275Z building singularity image
2021-08-06T21:26:17.223285265Z [singularity build /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/keep207354666/by_uuid/tordo-4zz18-t0wx4utpwx4ligf/image.sif docker-archive:///tmp/crunch-run-singularity-917581112/image.tar]
2021-08-06T21:26:26.466105517Z INFO: Starting build...
2021-08-06T21:26:26.466105517Z FATAL: While performing build: conveyor failed to get: Error loading tar component 337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7.json: open /tmp/crunch-run-singularity-917581112/image.tar: no such file or directory
2021-08-06T21:26:26.466234171Z error in Run: While loading container image: exit status 255
2021-08-06T21:26:26.466268708Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/tmp701045228: no such file or directory
2021-08-06T21:26:26.605452965Z Cancelled

On the second attempt, it tried to create a collection with the same temporary name (down to the exact timestamp?) and that failed.

2021-08-06T21:26:47.149336231Z Executing container 'tordo-dz642-amjt50vnz4qyn4n'
...
2021-08-06T21:26:47.972965997Z error in Run: While loading container image: error creating 'singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z' collection: request failed: https://tordo.arvadosapi.com/arvados/v1/collections: 422 Unprocessable Entity: //railsapi.internal/arvados/v1/collections: 422 Unprocessable Entity: #<ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_collections_on_owner_uuid_and_name"
2021-08-06T21:26:47.972965997Z DETAIL: Key (owner_uuid, name)=(tordo-j7d0g-7p82g804nk5l7gx, singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z) already exists.
2021-08-06T21:26:47.972965997Z : INSERT INTO "collections" ("owner_uuid", "created_at", "modified_by_user_uuid", "modified_at", "portable_data_hash", "updated_at", "uuid", "manifest_text", "name", "properties", "delete_at", "file_names", "trash_at", "current_version_uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14) RETURNING "id"> (req-8t57dqc95orqsvelydce)
2021-08-06T21:26:48.136224600Z Cancelled


Subtasks 1 (0 open1 closed)

Task #18034: Review 17755-singularity-raceResolvedPeter Amstutz08/24/2021Actions

Related issues

Related to Arvados - Idea #17755: Test singularity support on a cloud cluster by running some real workflowsResolvedWard Vandewege09/03/2021Actions
Related to Arvados - Bug #18027: [crunch-run] upload failed (singularity)ResolvedTom Clegg09/03/2021Actions
Actions

Also available in: Atom PDF