Bug #22226
closedcrunch-run segfault+panic when output directory includes a file mounted from a collection
Description
I have the full log if needed but here's the important bits. Worth noting, this is on a test compute node with Docker 27. The same compute image works with Singularity okay.
Oct 21 14:51:18 crunch-run[1495]: crunch-run 2.8.0~dev20241015141800 (go1.21.10) started … Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.242438032Z copying "*" Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.245246755Z copying ".aws/config" from c80031b757ffc50684d0b8cbe5282587+112/.aws/config Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.269394051Z copying "download.sh" from c80031b757ffc50684d0b8cbe5282587+112/download.sh Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.271639670Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.tordo-dz642-84o7gwkioo1nvn9.3147090036/keep1196122805] Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.981973623Z crunch-run finished Oct 21 14:51:55 crunch-run[1495]: panic: runtime error: invalid memory address or nil pointer dereference Oct 21 14:51:55 crunch-run[1495]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x9581e9] Oct 21 14:51:55 crunch-run[1495]: goroutine 1 [running]: Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/sdk/go/arvados.(*dirnode).Child(0xc0000a4900, {0xc0007b8570, 0xb}, 0x0?) Oct 21 14:51:55 crunch-run[1495]: /arvados/sdk/go/arvados/fs_collection.go:993 +0x29 Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/sdk/go/arvados.(*filenode).Splice(0xc0000b2e60, {0x2969988?, 0xc000444e60?}) Oct 21 14:51:55 crunch-run[1495]: /arvados/sdk/go/arvados/fs_collection.go:964 +0x168 Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/sdk/go/arvados.(*filehandle).Splice(0x0?, 0x0?) Oct 21 14:51:55 crunch-run[1495]: /arvados/sdk/go/arvados/fs_filehandle.go:152 +0x2e Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/sdk/go/arvados.Splice({0x2969e28, 0xc0000be840}, {0xc0000ad4ce, 0xc}, 0xc000524d00?) Oct 21 14:51:55 crunch-run[1495]: /arvados/sdk/go/arvados/fs_base.go:871 +0x1e4 Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).copyFromCollection(0xc000712ce8, {0xc0000ad4ce, 0xc}, 0xc000252300?, {0xc0002d8610, 0xb}) Oct 21 14:51:55 crunch-run[1495]: /arvados/lib/crunchrun/copier.go:424 +0x20b Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).walkMount(0xc00076ece8, {0xc0000ad4ce, 0xc}, {0xc0000ad4c0, 0x1a}, 0x0?, 0x0) Oct 21 14:51:55 crunch-run[1495]: /arvados/lib/crunchrun/copier.go:375 +0x925 Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).walkMountsBelow(0xc00076ece8, {0x0, 0x0}, {0xc0002d96d0, 0xe}) Oct 21 14:51:55 crunch-run[1495]: /arvados/lib/crunchrun/copier.go:454 +0x24f Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).walkMount(0xc000712ce8, {0x0, 0x0}, {0xc0002d96d0, 0xe}, 0x3?, 0x1) Oct 21 14:51:55 crunch-run[1495]: /arvados/lib/crunchrun/copier.go:401 +0xc0d Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).Copy(0xc00076ece8) Oct 21 14:51:55 crunch-run[1495]: /arvados/lib/crunchrun/copier.go:80 +0x139 Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*ContainerRunner).CaptureOutput(0xc00001c300, 0xc0003863c0?) Oct 21 14:51:56 crunch-run[1495]: /arvados/lib/crunchrun/crunchrun.go:1336 +0x27c Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*ContainerRunner).Run.func2() Oct 21 14:51:56 crunch-run[1495]: /arvados/lib/crunchrun/crunchrun.go:1679 +0xf9 Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*ContainerRunner).Run(0xc00001c300) Oct 21 14:51:56 crunch-run[1495]: /arvados/lib/crunchrun/crunchrun.go:1765 +0x1007 Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.command.RunCommand({}, {0x7ffc7d645dae, 0x43}, {0xc000106190, 0x5, 0x5}, {0x2932aa0?, 0xc000130018}, {0x2932ac0, 0xc000130020}, ...) Oct 21 14:51:56 crunch-run[1495]: /arvados/lib/crunchrun/crunchrun.go:2101 +0x29ff Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/cmd.Multi.RunCommand(0x47415a?, {0x7ffc7d645dae, 0x43}, {0xc000106190?, 0x5, 0x5}, {0x2932aa0, 0xc000130018}, {0x2932ac0, 0xc000130020}, ...) Oct 21 14:51:56 crunch-run[1495]: /arvados/lib/cmd/cmd.go:105 +0x2da Oct 21 14:51:56 crunch-run[1495]: main.main() Oct 21 14:51:56 crunch-run[1495]: /arvados/cmd/arvados-server/cmd.go:73 +0x8e
Updated by Brett Smith 2 months ago
- Subject changed from crunch-run segfault+panic when finishing up a compute job to crunch-run segfault+panic when finishing up an S3 downloader compute job
Updated by Brett Smith 2 months ago
- Assigned To set to Brett Smith
- Status changed from New to In Progress
Updated by Brett Smith 2 months ago
- Related to Bug #21891: Very slow copying from collection into output in crunch-run copy phase added
Updated by Brett Smith 2 months ago
The failed process used a single S3 URL to an entire bucket as the input. On reviewing the source I don't believe this is correct use of the workflow. A run that used two S3 file URLs as input succeeded (see process).
So, the issue might be triggered by an empty output directory, possibly Keep-backed? I still think we shouldn't crash in this case but it's a clue.
Updated by Brett Smith 2 months ago
Brett Smith wrote in #note-5:
So, the issue might be triggered by an empty output directory, possibly Keep-backed?
Perhaps even more specifically: the output directory only contains other mounts. (In the case of the S3 downloader, that's download.sh
and .aws/config
from an intermediate collection.)
Updated by Brett Smith 2 months ago
ee21f342b1ab0152fbe5030dc39ed1fc5cc2ec1b adds a test to reproduce the problem. TestOutputCollectionWithOnlySubmounts
panics with the exact same stack trace as seen in the logs.
It does not crash if .arvados#collection
has an empty manifest_text
. So that's really key to the bug. It's less clear what a proper fix would be.
Updated by Brett Smith 2 months ago
It's possible there are multiple bugs here.
Here's a run that crashed. This is the run that produced the logs above. It was launched through Workbench's "Run a workflow" interface.
Here's a run that succeeded. It was launched by me running arvados-cwl-runner in my development environment. It doesn't try to copy the file mounted at the top level of the output collection:
2024-10-21T18:40:28.065539182Z copying "*" 2024-10-21T18:40:28.071408808Z copying ".aws/config" from 88d703c507f2a67d364f465541bc1904+112/.aws/config 2024-10-21T18:40:28.104926517Z not copying "download.sh" because contents cannot match output globs
And sure enough, the successful run has an output_glob
that corresponds to the provided S3 URLs:
"output_glob": [ "aws-s3-bulk-download.cwl", "aws-s3-bulk-download.cwl/**", "aws-s3-download.cwl", "aws-s3-download.cwl/**", "cwl.output.json" ]
Whereas the failed run has an empty output_glob
.
Need to investigate why that discrepancy exists. It may be that tordo runs a-c-r
inside a Docker container with an older version that doesn't generate output_glob
.
We may still want to fix the crash in crunch-run anyway, since it seems like the user could always write an output_glob
that would include a mounted file in this same way.
Updated by Brett Smith 2 months ago
Brett Smith wrote in #note-9:
Need to investigate why that discrepancy exists. It may be that tordo runs
a-c-r
inside a Docker container with an older version that doesn't generateoutput_glob
.
That's not it, from the failed run logs:
2024-10-21T14:48:57.432384841Z INFO /usr/lib/python3-arvados-cwl-runner/bin/arvados-cwl-runner 2.8.0.dev20241008154607, arvados-python-client 2.8.0.dev20241008144825, cwltool 3.1.20240508115724That's recent enough to generate
output_glob
.
Whereas the failed run has an empty
output_glob
.
It's probably actually the difference in inputs. Since the failed run has a single S3 URL that tries to download an entire bucket, the output_glob
reduces to the empty set.
Updated by Brett Smith 2 months ago
- Subject changed from crunch-run segfault+panic when finishing up an S3 downloader compute job to crunch-run segfault+panic when output directory includes a file mounted from a collection
Updated by Brett Smith 2 months ago
Branch for discussion of requirements and impact check, it may be none of this gets merged as written: 22226-empty-output-panic @ 5bacea932060f175fd13a24ee909b88844dd3f17 - developer-run-tests: #4520 (I assume the Workbench tests failure is not related but maybe I should dig in)
Updated by Tom Clegg 2 months ago
- Assigned To changed from Brett Smith to Tom Clegg
22226-empty-output-panic @ 7c16d2d0e821d73f791b247499cd79923aaab4bd -- developer-run-tests: #4522
Keeps the test case added by Brett above, and makes it pass by fixing a "stale parent pointer" bug in the Splice implementation. Details in commit message.
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2024-10-23 sprint to Development 2024-11-06 sprint
Updated by Brett Smith 2 months ago
Tom Clegg wrote in #note-13:
22226-empty-output-panic @ 7c16d2d0e821d73f791b247499cd79923aaab4bd -- developer-run-tests: #4522
Keeps the test case added by Brett above, and makes it pass by fixing a "stale parent pointer" bug in the Splice implementation. Details in commit message.
Question about keeping the test case: at standup we talked about how the mounted file should be copied to the output collection. Does that mean the asserts on the copier should be updated to check that that's happening? Or am I misunderstanding the mechanism here?
If I cleaned up the test to use fixtures from arvadostest, would you take that change?
I realize nobody else will ever care about this but I'd appreciate if you were listed as the author of the most recent commit instead of me, since you wrote literally the entire thing.
Thanks.
Updated by Tom Clegg 2 months ago
Brett Smith wrote in #note-15:
Question about keeping the test case: at standup we talked about how the mounted file should be copied to the output collection. Does that mean the asserts on the copier should be updated to check that that's happening? Or am I misunderstanding the mechanism here?
Yes, I've updated the test case to confirm the mounted file gets included in the output.
If I cleaned up the test to use fixtures from arvadostest, would you take that change?
I've updated it to use arvadostest.FooCollectionPDH.
I realize nobody else will ever care about this but I'd appreciate if you were listed as the author of the most recent commit instead of me, since you wrote literally the entire thing.
Oops, I squashed commits and forgot that it would keep the author from the first commit. Fixed.
22226-empty-output-panic @ 3ada7fc03565cffdfe660ad553e3aaca2b02d504 -- developer-run-tests: #4525
Updated by Brett Smith 2 months ago
Tom Clegg wrote in #note-16:
22226-empty-output-panic @ 3ada7fc03565cffdfe660ad553e3aaca2b02d504 -- developer-run-tests: #4525
So, I believe the test as written tests the thing it's intending to test. The comments help a lot with that, so thank you for those. However, I think it's also a little confusing to follow, because it sets up a situation that can't happen in reality by writing an output file that then gets mounted over. Real compute jobs don't get a chance to execute any code in between all the different mount setups.
The only reason I wrote the foo
file initially was to make the environment as close as possible to the "real world" conditions triggering the bug. If I'm following right, with your new asserts, we don't need to write it at all: we can just test that the existence of the mount record means foo
gets staged.
If I have all that right, I'd suggest deleting all the foo
-writing code as a readability improvement. I think it's easier to follow "foo
gets mounted and then gets staged" as opposed to playing the shell game of figuring out which foo
is the real foo
.
Either way, assuming tests pass this is good to merge. Thanks.
Updated by Tom Clegg 2 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|e554a3d040827102d7d193fde7b8a42a0a09abdb.