Bug #8805
closed[Crunch] os.walk doing recursive copy early in a Crunch script causes a silent exit 1
Description
- https://workbench.qr1hi.arvadosapi.com/pipeline_instances/qr1hi-d1hrv-hu682pkl5bjfpny#
- https://workbench.qr1hi.arvadosapi.com/pipeline_instances/qr1hi-d1hrv-0agoz5tizyfudr1#
They both go through a directory, using os.walk, copying everything to a temporary directory.
I used excessive logging to figure out that both jobs were failing at different points, despite having the same inputs, crunch_script version, docker image, and compute node.
This behavior was not observed when using subprocess.check_call(['cp', '-r'])
Files
Updated by Sarah Guthrie almost 9 years ago
- Description updated (diff)
- Priority changed from High to Normal
Updated by Brett Smith almost 9 years ago
- File 8805walk.py 8805walk.py added
- Subject changed from Exact same job fails inconsistently on qr1hi to [Crunch] os.walk doing recursive copy early in a Crunch script causes a silent exit 1
This is a very subtle bug that seems to involve some bad timing between os.walk and the underlying storage, possibly interacting with Docker's volume layer.
The job is just doing an os.walk to recursively copy the CRUNCH_SRC tree to TASK_WORK, using os.mkdir and shutil.copyfile as appropriate. When it fails, the job silently exits 1, at a seemingly random time during the walk. There's no exception or other error message.
arv-mount shouldn't be involved, because neither the source nor destination files live under it.
I can reliably reproduce the problem when running the same script version the same way. Adding more logging print statements and exception handling (wrapping the whole walk in a try block) doesn't change anything.
If I run ls -lR
on the CRUNCH_SRC and TASK_WORK trees before starting the walk, it no longer exits. See the brett-walk-debug branch in the original job repository.
If I try to reproduce the problem with a standalone script in my own repository (brett
on qr1hi), I can't do it. This is true whether I use a copy implementation exactly like Sally's, or one that eschews shutil for Python builtins (I tried this in the hopes of getting lower-level error reporting). For posterity, I've attached that script. I'm guessing I can't reproduce because my CRUNCH_SRC is so much smaller and simpler than the one Sally's script is running with.
Updated by Brett Smith almost 9 years ago
I managed to watch the system logs on a compute node as a job ran and failed, but they didn't reveal anything interesting either.