Bug #18026

[singularity] crunch-run image cache race condition

Added by Ward Vandewege about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
08/24/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

This is from https://dev.arvados.org/issues/17755#note-26: =====================================================================================================

The first two failures I think are due to a race condition between two crunch-run processes trying to convert and caching the singularity image.

This attempt found the collection, but apparently before the cached image had been created. (not supposed to happen.)

2021-08-06T21:26:16.608212665Z Using Docker image id "sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7"
2021-08-06T21:26:16.608240987Z Loading Docker image from keep
2021-08-06T21:26:17.188261275Z building singularity image
2021-08-06T21:26:17.223285265Z [singularity build /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/keep207354666/by_uuid/tordo-4zz18-t0wx4utpwx4ligf/image.sif docker-archive:///tmp/crunch-run-singularity-917581112/image.tar]
2021-08-06T21:26:26.466105517Z INFO: Starting build...
2021-08-06T21:26:26.466105517Z FATAL: While performing build: conveyor failed to get: Error loading tar component 337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7.json: open /tmp/crunch-run-singularity-917581112/image.tar: no such file or directory
2021-08-06T21:26:26.466234171Z error in Run: While loading container image: exit status 255
2021-08-06T21:26:26.466268708Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/tmp701045228: no such file or directory
2021-08-06T21:26:26.605452965Z Cancelled

On the second attempt, it tried to create a collection with the same temporary name (down to the exact timestamp?) and that failed.

2021-08-06T21:26:47.149336231Z Executing container 'tordo-dz642-amjt50vnz4qyn4n'
...
2021-08-06T21:26:47.972965997Z error in Run: While loading container image: error creating 'singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z' collection: request failed: https://tordo.arvadosapi.com/arvados/v1/collections: 422 Unprocessable Entity: //railsapi.internal/arvados/v1/collections: 422 Unprocessable Entity: #<ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_collections_on_owner_uuid_and_name"
2021-08-06T21:26:47.972965997Z DETAIL: Key (owner_uuid, name)=(tordo-j7d0g-7p82g804nk5l7gx, singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z) already exists.
2021-08-06T21:26:47.972965997Z : INSERT INTO "collections" ("owner_uuid", "created_at", "modified_by_user_uuid", "modified_at", "portable_data_hash", "updated_at", "uuid", "manifest_text", "name", "properties", "delete_at", "file_names", "trash_at", "current_version_uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14) RETURNING "id"> (req-8t57dqc95orqsvelydce)
2021-08-06T21:26:48.136224600Z Cancelled


Subtasks

Task #18034: Review 17755-singularity-raceResolvedPeter Amstutz


Related issues

Related to Arvados - Story #17755: Test singularity support on a cloud cluster by running some real workflowsResolved09/03/2021

Related to Arvados - Bug #18027: [crunch-run] upload failed (singularity)Resolved09/03/2021

History

#1 Updated by Ward Vandewege about 2 months ago

  • Description updated (diff)

#2 Updated by Ward Vandewege about 2 months ago

  • Description updated (diff)

#3 Updated by Ward Vandewege about 2 months ago

  • Related to Story #17755: Test singularity support on a cloud cluster by running some real workflows added

#4 Updated by Peter Amstutz about 2 months ago

  • Category set to Crunch
  • Subject changed from [singularity] crunch-run race condition to [singularity] crunch-run image cache race condition

#5 Updated by Peter Amstutz about 2 months ago

  • Assigned To set to Peter Amstutz

#6 Updated by Peter Amstutz about 2 months ago

17755-singularity-race @ 22d2299b6d6280726c9692b72e2109a4fa768969

  • Adds ensure_unique_name to prevent two processes from creating the same "in progress" cache collection at the same time
  • Adds a Stat() check on the docker image tarfile, which is being read from singularity. This is a harmless (but possibly also useless) nudge on arv-mount to let it know we are about to read the file.

On further research, I'm still unsure about what happened in the first case (where it gets "no such file or directory" on both reading the docker image and trying to read the output directory). The arv-mount.txt log is empty. This makes me suspicious that arv-mount is either not ready or has crashed. The Stat() check will give a slightly more direct signal about whether the kernel thinks the Docker image tarfile exists.

#7 Updated by Peter Amstutz about 2 months ago

  • Status changed from New to In Progress

#8 Updated by Tom Clegg about 2 months ago

Peter Amstutz wrote:

17755-singularity-race @ 22d2299b6d6280726c9692b72e2109a4fa768969

  • Adds ensure_unique_name to prevent two processes from creating the same "in progress" cache collection at the same time

LGTM

  • Adds a Stat() check on the docker image tarfile, which is being read from singularity. This is a harmless (but possibly also useless) nudge on arv-mount to let it know we are about to read the file.

On further research, I'm still unsure about what happened in the first case (where it gets "no such file or directory" on both reading the docker image and trying to read the output directory). The arv-mount.txt log is empty. This makes me suspicious that arv-mount is either not ready or has crashed. The Stat() check will give a slightly more direct signal about whether the kernel thinks the Docker image tarfile exists.

Perhaps "start arv-mount and use it without waiting for it to be ready" worked reliably in the past only because our Docker image checking/loading inserted a substantial delay between "start" and "use", but now that image loading uses the fuse mount, we're sometimes hitting it before it's ready.

If that's true, I suppose we need a "wait for arv-mount to be ready" polling loop or something. Meanwhile/regardless, Stat seems like it will be either harmless or helpful to reveal what's happening.

LGTM

#9 Updated by Tom Clegg about 2 months ago

Uh, on second thought
  • we already have a "wait for arv-mount to be ready" polling loop, as Peter pointed out in chat
  • shouldn't the new Stat() call be
-               if _, err := os.Stat(dockerImageID); err != nil {
+               if _, err := os.Stat(imageTarballPath); err != nil {

?

#10 Updated by Peter Amstutz about 2 months ago

Tom Clegg wrote:

Uh, on second thought
  • we already have a "wait for arv-mount to be ready" polling loop, as Peter pointed out in chat
  • shouldn't the new Stat() call be

[...]

?

Fixed.

17755-singularity-race @ adc1cd0362bb9fd201b8c470eaf816ca5c212c72

https://ci.arvados.org/view/Developer/job/developer-run-tests/2653/

#11 Updated by Peter Amstutz about 2 months ago

The other problem with accessing the keep mount seems to be separate, it is being investigated in #18027

#12 Updated by Peter Amstutz about 2 months ago

  • Related to Bug #18027: [crunch-run] upload failed (singularity) added

#13 Updated by Peter Amstutz about 2 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF