Project

General

Profile

Actions

Bug #8805

closed

[Crunch] os.walk doing recursive copy early in a Crunch script causes a silent exit 1

Added by Sarah Guthrie about 8 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

Two jobs, identical in inputs, both on qr1hi, in quick succession (on the same node) failed at different points in the process of running a crunch script.

They both go through a directory, using os.walk, copying everything to a temporary directory.

I used excessive logging to figure out that both jobs were failing at different points, despite having the same inputs, crunch_script version, docker image, and compute node.

This behavior was not observed when using subprocess.check_call(['cp', '-r'])


Files

8805walk.py (1.27 KB) 8805walk.py Brett Smith, 03/28/2016 06:28 PM

Related issues

Related to Arvados - Feature #8801: [Crunch] log free disk space before task startsDuplicateActions
Actions #1

Updated by Sarah Guthrie about 8 years ago

  • Description updated (diff)
  • Priority changed from High to Normal
Actions #2

Updated by Brett Smith about 8 years ago

  • File 8805walk.py 8805walk.py added
  • Subject changed from Exact same job fails inconsistently on qr1hi to [Crunch] os.walk doing recursive copy early in a Crunch script causes a silent exit 1

This is a very subtle bug that seems to involve some bad timing between os.walk and the underlying storage, possibly interacting with Docker's volume layer.

The job is just doing an os.walk to recursively copy the CRUNCH_SRC tree to TASK_WORK, using os.mkdir and shutil.copyfile as appropriate. When it fails, the job silently exits 1, at a seemingly random time during the walk. There's no exception or other error message.

arv-mount shouldn't be involved, because neither the source nor destination files live under it.

I can reliably reproduce the problem when running the same script version the same way. Adding more logging print statements and exception handling (wrapping the whole walk in a try block) doesn't change anything.

If I run ls -lR on the CRUNCH_SRC and TASK_WORK trees before starting the walk, it no longer exits. See the brett-walk-debug branch in the original job repository.

If I try to reproduce the problem with a standalone script in my own repository (brett on qr1hi), I can't do it. This is true whether I use a copy implementation exactly like Sally's, or one that eschews shutil for Python builtins (I tried this in the hopes of getting lower-level error reporting). For posterity, I've attached that script. I'm guessing I can't reproduce because my CRUNCH_SRC is so much smaller and simpler than the one Sally's script is running with.

Actions #3

Updated by Brett Smith about 8 years ago

I managed to watch the system logs on a compute node as a job ran and failed, but they didn't reveal anything interesting either.

Actions #4

Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF