Project

General

Profile

Actions

Bug #14495

closed

[crunch2] include space required to download/unpack docker image in tmp disk request

Added by Ward Vandewege about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

Proposed fix:

API server should add Docker image size multiplied by 3 to the disk space request. (The multiplication factor is to account for expansion of compressed layers, and staging the layers to scratch space while they are decompressed.)

Original description:

Container request d79c1-xvhdp-28emqt3jby9s2a8 was seemingly stuck - its child container requests remained in the queued state after many hours.

What was actually happening was that the child container d79c1-dz642-3apw65ik2snziqh was scheduled on a compute node that didn't have sufficient scratch space available to load the (large) docker image:

2018-11-14T13:33:31.066367347Z Docker response: {"errorDetail":{"message":"Error processing tar file(exit status 1): write /195522a76483d96f4cc529d0b11e5e840596992eddfb89f4f86499a4af381832/layer.tar: no space left on device"},"error":"Error processing tar file(exit status 1): write /195522a76483d96f4cc529d0b11e5e840596992eddfb89f4f86499a4af381832/layer.tar: no space left on device"}
2018-11-14T13:33:31.066643166Z Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.d79c1-dz642-3apw65ik2snziqh.694256924/keep198730187]

This kept happening, and because the container couldn't be started, it remained in Queued state in workbench, with the only hint the above lines in the logs.

The workaround is easy - specify a large enough tmpdirMin in the workflow, but it's way too hard for the user to figure that out.

We should probably error out immediately when this happens, and we need to make it clear to the user what the actual problem is.

Or maybe we can take the size of the docker image into account before allocating a job to a compute node? That would be even better.


Subtasks 1 (0 open1 closed)

Task #14544: Review 14495-crunch-docker-spaceResolvedPeter Amstutz12/10/2018Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #14540: [API] Limit number of container lock/unlock cyclesDuplicateActions
Actions #1

Updated by Ward Vandewege about 6 years ago

  • Target version set to To Be Groomed
Actions #2

Updated by Ward Vandewege about 6 years ago

  • Subject changed from [crunch2] fail container if the compute node it is being run on doesn't have sufficient space to load the docker image to [crunch2] containers are retried indefinitely if the compute node it is being run on doesn't have sufficient space to load the docker image
  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 6 years ago

Yes, this should be handled by the infrastructure.

I had a discussion on this exact topic with the Cromwell folks recently, the heuristic we came up with was to reserve 3x space as the size of the sum of the image layers (that's just the size of the image tarball).

We also allocate nodes based on total space not _available_space so if a node has been up for a while, it could end up caching multiple Docker images reducing available space and throwing off the calculation.

Actions #4

Updated by Peter Amstutz about 6 years ago

Also we should limit the number of lock/unlock cycles to avoid this "infinite retry" problem.

Actions #5

Updated by Tom Morris about 6 years ago

  • Target version changed from To Be Groomed to 2018-12-12 Sprint
Actions #6

Updated by Peter Amstutz about 6 years ago

  • Subject changed from [crunch2] containers are retried indefinitely if the compute node it is being run on doesn't have sufficient space to load the docker image to [crunch2] include space required to download/unpack docker image in tmp disk request
Actions #7

Updated by Peter Amstutz about 6 years ago

  • Related to Bug #14540: [API] Limit number of container lock/unlock cycles added
Actions #8

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
Actions #9

Updated by Peter Amstutz about 6 years ago

  • Assigned To set to Peter Amstutz
Actions #10

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
Actions #11

Updated by Peter Amstutz about 6 years ago

We calculate the disk space request as the sum of the requested capacity of "tmp" mounts, not a single number stored in runtime_constraints (unlike the ram request). So the dispatcher should be the one that incorporates image size into the disk space request, not the API server. On the plus side, not modifying the container record means it doesn't invalidate reuse.

Actions #12

Updated by Tom Clegg about 6 years ago

Peter Amstutz wrote:

the dispatcher should be the one that incorporates image size into the disk space request, not the API server.

Agreed

Actions #13

Updated by Peter Amstutz about 6 years ago

14495-crunch-docker-space @ 934d880aa5d10ed3382f9924a9a9f5694b41f266

  • Estimate size of docker image
  • Incorporate estimate into disk space request

https://ci.curoverse.com/view/Developer/job/developer-run-tests/1006/

Actions #14

Updated by Lucas Di Pentima about 6 years ago

  • Nice code comments and clever way to do the estimation!
  • On the two cases that the estimation is 0, can we log a warning message for potential debug needs?
  • Apart from that, LGTM.
Actions #15

Updated by Peter Amstutz about 6 years ago

Need to go and double check that this fix would have fixed the original problem report.

Actions #16

Updated by Peter Amstutz about 6 years ago

  • Target version changed from 2018-12-12 Sprint to 2018-12-21 Sprint
Actions #17

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Resolved
Actions #18

Updated by Tom Morris almost 6 years ago

  • Release set to 15
Actions

Also available in: Atom PDF