Project

General

Profile

Actions

Bug #18026

closed

[singularity] crunch-run image cache race condition

Added by Ward Vandewege over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-

Description

This is from https://dev.arvados.org/issues/17755#note-26: =====================================================================================================

The first two failures I think are due to a race condition between two crunch-run processes trying to convert and caching the singularity image.

This attempt found the collection, but apparently before the cached image had been created. (not supposed to happen.)

2021-08-06T21:26:16.608212665Z Using Docker image id "sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7"
2021-08-06T21:26:16.608240987Z Loading Docker image from keep
2021-08-06T21:26:17.188261275Z building singularity image
2021-08-06T21:26:17.223285265Z [singularity build /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/keep207354666/by_uuid/tordo-4zz18-t0wx4utpwx4ligf/image.sif docker-archive:///tmp/crunch-run-singularity-917581112/image.tar]
2021-08-06T21:26:26.466105517Z INFO: Starting build...
2021-08-06T21:26:26.466105517Z FATAL: While performing build: conveyor failed to get: Error loading tar component 337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7.json: open /tmp/crunch-run-singularity-917581112/image.tar: no such file or directory
2021-08-06T21:26:26.466234171Z error in Run: While loading container image: exit status 255
2021-08-06T21:26:26.466268708Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/tmp701045228: no such file or directory
2021-08-06T21:26:26.605452965Z Cancelled

On the second attempt, it tried to create a collection with the same temporary name (down to the exact timestamp?) and that failed.

2021-08-06T21:26:47.149336231Z Executing container 'tordo-dz642-amjt50vnz4qyn4n'
...
2021-08-06T21:26:47.972965997Z error in Run: While loading container image: error creating 'singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z' collection: request failed: https://tordo.arvadosapi.com/arvados/v1/collections: 422 Unprocessable Entity: //railsapi.internal/arvados/v1/collections: 422 Unprocessable Entity: #<ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_collections_on_owner_uuid_and_name"
2021-08-06T21:26:47.972965997Z DETAIL: Key (owner_uuid, name)=(tordo-j7d0g-7p82g804nk5l7gx, singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z) already exists.
2021-08-06T21:26:47.972965997Z : INSERT INTO "collections" ("owner_uuid", "created_at", "modified_by_user_uuid", "modified_at", "portable_data_hash", "updated_at", "uuid", "manifest_text", "name", "properties", "delete_at", "file_names", "trash_at", "current_version_uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14) RETURNING "id"> (req-8t57dqc95orqsvelydce)
2021-08-06T21:26:48.136224600Z Cancelled


Subtasks 1 (0 open1 closed)

Task #18034: Review 17755-singularity-raceResolvedPeter Amstutz08/24/2021Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Idea #17755: Test singularity support on a cloud cluster by running some real workflowsResolvedWard Vandewege09/03/2021Actions
Related to Arvados - Bug #18027: [crunch-run] upload failed (singularity)ResolvedTom Clegg09/03/2021Actions
Actions #1

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)
Actions #2

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)
Actions #3

Updated by Ward Vandewege over 3 years ago

  • Related to Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Actions #4

Updated by Peter Amstutz over 3 years ago

  • Category set to Crunch
  • Subject changed from [singularity] crunch-run race condition to [singularity] crunch-run image cache race condition
Actions #5

Updated by Peter Amstutz over 3 years ago

  • Assigned To set to Peter Amstutz
Actions #6

Updated by Peter Amstutz over 3 years ago

17755-singularity-race @ 22d2299b6d6280726c9692b72e2109a4fa768969

  • Adds ensure_unique_name to prevent two processes from creating the same "in progress" cache collection at the same time
  • Adds a Stat() check on the docker image tarfile, which is being read from singularity. This is a harmless (but possibly also useless) nudge on arv-mount to let it know we are about to read the file.

On further research, I'm still unsure about what happened in the first case (where it gets "no such file or directory" on both reading the docker image and trying to read the output directory). The arv-mount.txt log is empty. This makes me suspicious that arv-mount is either not ready or has crashed. The Stat() check will give a slightly more direct signal about whether the kernel thinks the Docker image tarfile exists.

Actions #7

Updated by Peter Amstutz over 3 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Tom Clegg over 3 years ago

Peter Amstutz wrote:

17755-singularity-race @ 22d2299b6d6280726c9692b72e2109a4fa768969

  • Adds ensure_unique_name to prevent two processes from creating the same "in progress" cache collection at the same time

LGTM

  • Adds a Stat() check on the docker image tarfile, which is being read from singularity. This is a harmless (but possibly also useless) nudge on arv-mount to let it know we are about to read the file.

On further research, I'm still unsure about what happened in the first case (where it gets "no such file or directory" on both reading the docker image and trying to read the output directory). The arv-mount.txt log is empty. This makes me suspicious that arv-mount is either not ready or has crashed. The Stat() check will give a slightly more direct signal about whether the kernel thinks the Docker image tarfile exists.

Perhaps "start arv-mount and use it without waiting for it to be ready" worked reliably in the past only because our Docker image checking/loading inserted a substantial delay between "start" and "use", but now that image loading uses the fuse mount, we're sometimes hitting it before it's ready.

If that's true, I suppose we need a "wait for arv-mount to be ready" polling loop or something. Meanwhile/regardless, Stat seems like it will be either harmless or helpful to reveal what's happening.

LGTM

Actions #9

Updated by Tom Clegg over 3 years ago

Uh, on second thought
  • we already have a "wait for arv-mount to be ready" polling loop, as Peter pointed out in chat
  • shouldn't the new Stat() call be
-               if _, err := os.Stat(dockerImageID); err != nil {
+               if _, err := os.Stat(imageTarballPath); err != nil {

?

Actions #10

Updated by Peter Amstutz over 3 years ago

Tom Clegg wrote:

Uh, on second thought
  • we already have a "wait for arv-mount to be ready" polling loop, as Peter pointed out in chat
  • shouldn't the new Stat() call be

[...]

?

Fixed.

17755-singularity-race @ adc1cd0362bb9fd201b8c470eaf816ca5c212c72

developer-run-tests: #2653

Actions #11

Updated by Peter Amstutz over 3 years ago

The other problem with accessing the keep mount seems to be separate, it is being investigated in #18027

Actions #12

Updated by Peter Amstutz over 3 years ago

  • Related to Bug #18027: [crunch-run] upload failed (singularity) added
Actions #13

Updated by Peter Amstutz over 3 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF