Project

General

Profile

Actions

Bug #22763

closed

Figure out why WGS-processing chr19 crashes crunch-run 3.1.0

Added by Brett Smith 10 days ago. Updated 8 days ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

Passing 3.1.0~rc3 workflow: pirca-xvhdp-nwpd7hlkz2n235m

Failing 3.1.1~rc2 workflow: pirca-xvhdp-cv6l0mpe0vz4541 (and others in the same folder, all failing roughly the same way)


Related issues 2 (0 open2 closed)

Related to Arvados - Bug #22617: Compute node images don't ensure encrypted partitions because of cycle with docker.socketResolvedBrett SmithActions
Blocks Arvados - Support #22718: Release Arvados 3.1.1ResolvedBrett SmithActions
Actions #1

Updated by Brett Smith 10 days ago

Observations so far:

  • Both workflows have the same Git commit and empty status in metadata (3.1.1 has a different gitDescribe, I think just because I didn't fetch all tags or something)
  • Both workflows have the same input (confirmed by using Workbench's copy JSON functionality, pasting them to separate files, and diffing in the terminal)
  • Both workflows are running the "previous" version of crunch-run: the 3.1.0rc workflow uses crunch-run 3.0.0, while 3.1.1rc workflow uses crunch-run 3.1.0
    • I don't have an explanation for this because at this point in the release process, there should be an rc version of crunch-run on the dispatch node, and with our configuration I believe it should be deploying itself to each compute node.
  • The 3.1.0rc workflow uses arv-mount 3.0.0
    • This is sort of a gap/bug in our release process: when we build a new compute node, that's not using the testing repository, so it picks up the last release version instead.
  • The 3.1.1rc workflow is also using arv-mount 3.0.0
    • Because I didn't build a new compute node because 3.1.1 doesn't have any changes that should "end up" on one
  • The 3.1.1rc workflow is run with --disable-reuse. The 3.1.0rc workflow apparently wasn't: a lot of workflow steps are listed as reused.
Actions #2

Updated by Brett Smith 10 days ago

The 3.1.1rc workflow is run with --disable-reuse. The 3.1.0rc workflow apparently wasn't: a lot of workflow steps are listed as reused.

The last time we actually ran mark-duplicates was for 3.0.0~rc3. See pirca-xvhdp-vltlwwug3vfceqb and pirca-xvhdp-qk9ox6lq70s3ml7. Thankfully this ran with the correct crunch-run and arv-mount. But note it was workflow commit cf002b3d9d3149ced3d710dc343300e67568239b. Since then:

% git log --oneline cf002b3...
e4d896f (HEAD -> main, origin/main, origin/HEAD) More description of WGS workflow
835bc23 Improve guidance on WGS a bit.
8a369be fix qcreport type
c45e3be Improve WGS demo doc strings a bit.
4188b17 Merge branch 'wgs-no-keep-cache'
019c5b5 Remove keep_cache parameters because they are no longer needed
1acfef8 Bump fastq version
7bd8362 fix fastqc

I suspect the problem is one of the following:

  • A bug introduced in crunch-run 3.1.0 that slipped under the radar because we reused results from 3.0.0.
  • A bug introduced in the workflow in one of the commits above that is masked by container reuse.

I am running the workflow on jutro. It is still running 3.1.0. If we see the problem occur there, then I'm right that the problem is one of the above, then it proves that 3.1.1 is not the issue. jutro-xvhdp-ou1ttl3lwwsw3y3

Actions #3

Updated by Brett Smith 10 days ago

And to come at this from the other direction, here's the older version of the workflow (cf002b3d) running on 3.1.1rc2: pirca-xvhdp-42jcu9ldgpp31an

Actions #4

Updated by Brett Smith 10 days ago

Actions #5

Updated by Brett Smith 10 days ago

Brett Smith wrote in #note-2:

The 3.1.1rc workflow is run with --disable-reuse. The 3.1.0rc workflow apparently wasn't: a lot of workflow steps are listed as reused.

The last time we actually ran mark-duplicates was for 3.0.0~rc3. See pirca-xvhdp-vltlwwug3vfceqb and pirca-xvhdp-qk9ox6lq70s3ml7. Thankfully this ran with the correct crunch-run and arv-mount. But note it was workflow commit cf002b3d9d3149ced3d710dc343300e67568239b. Since then:

[...]

I suspect the problem is one of the following:

  • A bug introduced in crunch-run 3.1.0 that slipped under the radar because we reused results from 3.0.0.
  • A bug introduced in the workflow in one of the commits above that is masked by container reuse.

jutro-xvhdp-ou1ttl3lwwsw3y3: Current main workflow running on 3.1.0, failed
pirca-xvhdp-42jcu9ldgpp31an: Previously-successful worrkflow running on 3.1.1rc2, failed

I think a bug in crunch-run or maybe a-c-r 3.1.0 is the most likely explanation.

Actions #6

Updated by Brett Smith 10 days ago

  • Related to Bug #22617: Compute node images don't ensure encrypted partitions because of cycle with docker.socket added
Actions #7

Updated by Peter Amstutz 9 days ago

2025-04-08T21:58:52.410102783Z crunch-run 3.1.0 (go1.23.6) started

2025-04-08T21:58:52.410406930Z crunch-run process has uid=0(root) gid=0(root) groups=0(root)

2025-04-08T21:58:53.136835237Z Using FUSE mount: /usr/bin/arv-mount 3.0.0

I would expect this to be at least crunch-run 3.1.1rc2 (because arvados-dispatch-cloud was updated) and arv-mount 3.1.0 (because the compute AMI was updated).

Actions #8

Updated by Peter Amstutz 9 days ago

You're right that this is actually a 3.1.0 bug and it wasn't caught in the 3.1.0 release because I didn't forgot to add --disable-reuse. So that is on me. I'm glad we caught it and are tracking in down now.

Actions #9

Updated by Peter Amstutz 9 days ago

It shouldn't be possible for a "user process" like arvados-cwl-runner to crash crunch-run or arvados-dispatch-cloud like this so even if it was feeding bad information somehow, the bug would still be that a system component like crunch-run crashed.

Actions #10

Updated by Brett Smith 9 days ago

Peter Amstutz wrote in #note-6:

I would expect this to be at least crunch-run 3.1.1rc2 (because arvados-dispatch-cloud was updated) and arv-mount 3.1.0 (because the compute AMI was updated).

I agree, see #note-1. I understand why arv-mount isn't updated but crunch-run is a mystery to me right now.

Actions #11

Updated by Brett Smith 9 days ago

  • Subject changed from Figure out why WGS-processing chr19 is failing on 3.1.1rc2 to Figure out why WGS-processing chr19 crashes crunch-run 3.1.0
Actions #12

Updated by Brett Smith 9 days ago

Brett Smith wrote in #note-1:

  • Both workflows are running the "previous" version of crunch-run: the 3.1.0rc workflow uses crunch-run 3.0.0, while 3.1.1rc workflow uses crunch-run 3.1.0
    • I don't have an explanation for this because at this point in the release process, there should be an rc version of crunch-run on the dispatch node, and with our configuration I believe it should be deploying itself to each compute node.

For 3.1.1rc2 the explanation is #22755. Note in particular that #22755#note-5 says that 3.1.1rc2 has been deployed to pirca, but I don't see any evidence of that. See follow-ups on that ticket.

Actions #13

Updated by Brett Smith 9 days ago

Brett Smith wrote in #note-1:

  • Both workflows are running the "previous" version of crunch-run: the 3.1.0rc workflow uses crunch-run 3.0.0, while 3.1.1rc workflow uses crunch-run 3.1.0
    • I don't have an explanation for this because at this point in the release process, there should be an rc version of crunch-run on the dispatch node, and with our configuration I believe it should be deploying itself to each compute node.

This happened because of a bug in our Salt configuration. See #22766.

Actions #14

Updated by Brett Smith 9 days ago

Brett Smith wrote in #note-1:

  • The 3.1.0rc workflow uses arv-mount 3.0.0
    • This is sort of a gap/bug in our release process: when we build a new compute node, that's not using the testing repository, so it picks up the last release version instead.

This was not quite correct. The Jenkins job to build compute nodes does correctly select the testing repository. However, it still used the default package pins, so it selected the previous stable release from that repository. I have updated the job to disable Arvados package pins for these clusters.

Actions #15

Updated by Brett Smith 9 days ago

  • Status changed from In Progress to Closed

At this point we understand why the test workflow "passed" for 3.1.0, fails on 3.1.1, and we have split out several tickets to address those root causes. I am going to close this and continue investigation for the cause of the crash itself on the original ticket #22617 just so we don't have to keep looking for updates in two places.

Actions

Also available in: Atom PDF