Project

General

Profile

Actions

Bug #22226

closed

crunch-run segfault+panic when output directory includes a file mounted from a collection

Added by Brett Smith 7 days ago. Updated 4 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-
Release:
Release relationship:
Auto

Description

I have the full log if needed but here's the important bits. Worth noting, this is on a test compute node with Docker 27. The same compute image works with Singularity okay.

Oct 21 14:51:18 crunch-run[1495]: crunch-run 2.8.0~dev20241015141800 (go1.21.10) started
…
Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.242438032Z copying "*" 
Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.245246755Z copying ".aws/config" from c80031b757ffc50684d0b8cbe5282587+112/.aws/config
Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.269394051Z copying "download.sh" from c80031b757ffc50684d0b8cbe5282587+112/download.sh
Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.271639670Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.tordo-dz642-84o7gwkioo1nvn9.3147090036/keep1196122805]
Oct 21 14:51:55 crunch-run[1495]: tordo-dz642-84o7gwkioo1nvn9 2024-10-21T14:51:55.981973623Z crunch-run finished
Oct 21 14:51:55 crunch-run[1495]: panic: runtime error: invalid memory address or nil pointer dereference
Oct 21 14:51:55 crunch-run[1495]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x9581e9]
Oct 21 14:51:55 crunch-run[1495]: goroutine 1 [running]:
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/sdk/go/arvados.(*dirnode).Child(0xc0000a4900, {0xc0007b8570, 0xb}, 0x0?)
Oct 21 14:51:55 crunch-run[1495]:         /arvados/sdk/go/arvados/fs_collection.go:993 +0x29
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/sdk/go/arvados.(*filenode).Splice(0xc0000b2e60, {0x2969988?, 0xc000444e60?})
Oct 21 14:51:55 crunch-run[1495]:         /arvados/sdk/go/arvados/fs_collection.go:964 +0x168
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/sdk/go/arvados.(*filehandle).Splice(0x0?, 0x0?)
Oct 21 14:51:55 crunch-run[1495]:         /arvados/sdk/go/arvados/fs_filehandle.go:152 +0x2e
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/sdk/go/arvados.Splice({0x2969e28, 0xc0000be840}, {0xc0000ad4ce, 0xc}, 0xc000524d00?)
Oct 21 14:51:55 crunch-run[1495]:         /arvados/sdk/go/arvados/fs_base.go:871 +0x1e4
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).copyFromCollection(0xc000712ce8, {0xc0000ad4ce, 0xc}, 0xc000252300?, {0xc0002d8610, 0xb})
Oct 21 14:51:55 crunch-run[1495]:         /arvados/lib/crunchrun/copier.go:424 +0x20b
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).walkMount(0xc00076ece8, {0xc0000ad4ce, 0xc}, {0xc0000ad4c0, 0x1a}, 0x0?, 0x0)
Oct 21 14:51:55 crunch-run[1495]:         /arvados/lib/crunchrun/copier.go:375 +0x925
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).walkMountsBelow(0xc00076ece8, {0x0, 0x0}, {0xc0002d96d0, 0xe})
Oct 21 14:51:55 crunch-run[1495]:         /arvados/lib/crunchrun/copier.go:454 +0x24f
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).walkMount(0xc000712ce8, {0x0, 0x0}, {0xc0002d96d0, 0xe}, 0x3?, 0x1)
Oct 21 14:51:55 crunch-run[1495]:         /arvados/lib/crunchrun/copier.go:401 +0xc0d
Oct 21 14:51:55 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*copier).Copy(0xc00076ece8)
Oct 21 14:51:55 crunch-run[1495]:         /arvados/lib/crunchrun/copier.go:80 +0x139
Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*ContainerRunner).CaptureOutput(0xc00001c300, 0xc0003863c0?)
Oct 21 14:51:56 crunch-run[1495]:         /arvados/lib/crunchrun/crunchrun.go:1336 +0x27c
Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*ContainerRunner).Run.func2()
Oct 21 14:51:56 crunch-run[1495]:         /arvados/lib/crunchrun/crunchrun.go:1679 +0xf9
Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.(*ContainerRunner).Run(0xc00001c300)
Oct 21 14:51:56 crunch-run[1495]:         /arvados/lib/crunchrun/crunchrun.go:1765 +0x1007
Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/crunchrun.command.RunCommand({}, {0x7ffc7d645dae, 0x43}, {0xc000106190, 0x5, 0x5}, {0x2932aa0?, 0xc000130018}, {0x2932ac0, 0xc000130020}, ...)
Oct 21 14:51:56 crunch-run[1495]:         /arvados/lib/crunchrun/crunchrun.go:2101 +0x29ff
Oct 21 14:51:56 crunch-run[1495]: git.arvados.org/arvados.git/lib/cmd.Multi.RunCommand(0x47415a?, {0x7ffc7d645dae, 0x43}, {0xc000106190?, 0x5, 0x5}, {0x2932aa0, 0xc000130018}, {0x2932ac0, 0xc000130020}, ...)
Oct 21 14:51:56 crunch-run[1495]:         /arvados/lib/cmd/cmd.go:105 +0x2da
Oct 21 14:51:56 crunch-run[1495]: main.main()
Oct 21 14:51:56 crunch-run[1495]:         /arvados/cmd/arvados-server/cmd.go:73 +0x8e

Subtasks 1 (1 open0 closed)

Task #22233: Review 22226-empty-output-panicIn ProgressTom Clegg10/23/2024Actions

Related issues

Related to Arvados - Bug #21891: Very slow copying from collection into output in crunch-run copy phaseResolvedTom CleggActions
Actions #1

Updated by Brett Smith 7 days ago

  • Subject changed from crunch-run segfault+panic when finishing up a compute job to crunch-run segfault+panic when finishing up an S3 downloader compute job
Actions #2

Updated by Brett Smith 7 days ago

  • Assigned To set to Brett Smith
  • Status changed from New to In Progress
Actions #3

Updated by Brett Smith 7 days ago

  • Related to Bug #21891: Very slow copying from collection into output in crunch-run copy phase added
Actions #4

Updated by Peter Amstutz 7 days ago

  • Release set to 70
Actions #5

Updated by Brett Smith 7 days ago

The failed process used a single S3 URL to an entire bucket as the input. On reviewing the source I don't believe this is correct use of the workflow. A run that used two S3 file URLs as input succeeded (see process).

So, the issue might be triggered by an empty output directory, possibly Keep-backed? I still think we shouldn't crash in this case but it's a clue.

Actions #6

Updated by Brett Smith 7 days ago

Brett Smith wrote in #note-5:

So, the issue might be triggered by an empty output directory, possibly Keep-backed?

Perhaps even more specifically: the output directory only contains other mounts. (In the case of the S3 downloader, that's download.sh and .aws/config from an intermediate collection.)

Actions #7

Updated by Brett Smith 7 days ago

  • Description updated (diff)
Actions #8

Updated by Brett Smith 7 days ago

ee21f342b1ab0152fbe5030dc39ed1fc5cc2ec1b adds a test to reproduce the problem. TestOutputCollectionWithOnlySubmounts panics with the exact same stack trace as seen in the logs.

It does not crash if .arvados#collection has an empty manifest_text. So that's really key to the bug. It's less clear what a proper fix would be.

Actions #9

Updated by Brett Smith 6 days ago

It's possible there are multiple bugs here.

Here's a run that crashed. This is the run that produced the logs above. It was launched through Workbench's "Run a workflow" interface.

Here's a run that succeeded. It was launched by me running arvados-cwl-runner in my development environment. It doesn't try to copy the file mounted at the top level of the output collection:

2024-10-21T18:40:28.065539182Z copying "*" 
2024-10-21T18:40:28.071408808Z copying ".aws/config" from 88d703c507f2a67d364f465541bc1904+112/.aws/config
2024-10-21T18:40:28.104926517Z not copying "download.sh" because contents cannot match output globs

And sure enough, the successful run has an output_glob that corresponds to the provided S3 URLs:

"output_glob": [
  "aws-s3-bulk-download.cwl",
  "aws-s3-bulk-download.cwl/**",
  "aws-s3-download.cwl",
  "aws-s3-download.cwl/**",
  "cwl.output.json" 
]

Whereas the failed run has an empty output_glob.

Need to investigate why that discrepancy exists. It may be that tordo runs a-c-r inside a Docker container with an older version that doesn't generate output_glob.

We may still want to fix the crash in crunch-run anyway, since it seems like the user could always write an output_glob that would include a mounted file in this same way.

Actions #10

Updated by Brett Smith 6 days ago

Brett Smith wrote in #note-9:

Need to investigate why that discrepancy exists. It may be that tordo runs a-c-r inside a Docker container with an older version that doesn't generate output_glob.

That's not it, from the failed run logs:

2024-10-21T14:48:57.432384841Z INFO /usr/lib/python3-arvados-cwl-runner/bin/arvados-cwl-runner 2.8.0.dev20241008154607, arvados-python-client 2.8.0.dev20241008144825, cwltool 3.1.20240508115724
That's recent enough to generate output_glob.

Whereas the failed run has an empty output_glob.

It's probably actually the difference in inputs. Since the failed run has a single S3 URL that tries to download an entire bucket, the output_glob reduces to the empty set.

Actions #11

Updated by Brett Smith 6 days ago

  • Subject changed from crunch-run segfault+panic when finishing up an S3 downloader compute job to crunch-run segfault+panic when output directory includes a file mounted from a collection
Actions #12

Updated by Brett Smith 6 days ago

Branch for discussion of requirements and impact check, it may be none of this gets merged as written: 22226-empty-output-panic @ 5bacea932060f175fd13a24ee909b88844dd3f17 - developer-run-tests: #4520 (I assume the Workbench tests failure is not related but maybe I should dig in)

Actions #13

Updated by Tom Clegg 5 days ago

  • Assigned To changed from Brett Smith to Tom Clegg

22226-empty-output-panic @ 7c16d2d0e821d73f791b247499cd79923aaab4bd -- developer-run-tests: #4522

Keeps the test case added by Brett above, and makes it pass by fixing a "stale parent pointer" bug in the Splice implementation. Details in commit message.

Actions #14

Updated by Peter Amstutz 5 days ago

  • Target version changed from Development 2024-10-23 sprint to Development 2024-11-06 sprint
Actions #15

Updated by Brett Smith 5 days ago

Tom Clegg wrote in #note-13:

22226-empty-output-panic @ 7c16d2d0e821d73f791b247499cd79923aaab4bd -- developer-run-tests: #4522

Keeps the test case added by Brett above, and makes it pass by fixing a "stale parent pointer" bug in the Splice implementation. Details in commit message.

Question about keeping the test case: at standup we talked about how the mounted file should be copied to the output collection. Does that mean the asserts on the copier should be updated to check that that's happening? Or am I misunderstanding the mechanism here?

If I cleaned up the test to use fixtures from arvadostest, would you take that change?

I realize nobody else will ever care about this but I'd appreciate if you were listed as the author of the most recent commit instead of me, since you wrote literally the entire thing.

Thanks.

Actions #16

Updated by Tom Clegg 5 days ago

Brett Smith wrote in #note-15:

Question about keeping the test case: at standup we talked about how the mounted file should be copied to the output collection. Does that mean the asserts on the copier should be updated to check that that's happening? Or am I misunderstanding the mechanism here?

Yes, I've updated the test case to confirm the mounted file gets included in the output.

If I cleaned up the test to use fixtures from arvadostest, would you take that change?

I've updated it to use arvadostest.FooCollectionPDH.

I realize nobody else will ever care about this but I'd appreciate if you were listed as the author of the most recent commit instead of me, since you wrote literally the entire thing.

Oops, I squashed commits and forgot that it would keep the author from the first commit. Fixed.

22226-empty-output-panic @ 3ada7fc03565cffdfe660ad553e3aaca2b02d504 -- developer-run-tests: #4525

Actions #17

Updated by Brett Smith 5 days ago

Tom Clegg wrote in #note-16:

22226-empty-output-panic @ 3ada7fc03565cffdfe660ad553e3aaca2b02d504 -- developer-run-tests: #4525

So, I believe the test as written tests the thing it's intending to test. The comments help a lot with that, so thank you for those. However, I think it's also a little confusing to follow, because it sets up a situation that can't happen in reality by writing an output file that then gets mounted over. Real compute jobs don't get a chance to execute any code in between all the different mount setups.

The only reason I wrote the foo file initially was to make the environment as close as possible to the "real world" conditions triggering the bug. If I'm following right, with your new asserts, we don't need to write it at all: we can just test that the existence of the mount record means foo gets staged.

If I have all that right, I'd suggest deleting all the foo-writing code as a readability improvement. I think it's easier to follow "foo gets mounted and then gets staged" as opposed to playing the shell game of figuring out which foo is the real foo.

Either way, assuming tests pass this is good to merge. Thanks.

Actions #18

Updated by Tom Clegg 4 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF