Bug #22763
closedFigure out why WGS-processing chr19 crashes crunch-run 3.1.0
Description
Passing 3.1.0~rc3 workflow: pirca-xvhdp-nwpd7hlkz2n235m
Failing 3.1.1~rc2 workflow: pirca-xvhdp-cv6l0mpe0vz4541 (and others in the same folder, all failing roughly the same way)
Updated by Brett Smith 10 days ago
Observations so far:
- Both workflows have the same Git commit and empty status in metadata (3.1.1 has a different gitDescribe, I think just because I didn't fetch all tags or something)
- Both workflows have the same input (confirmed by using Workbench's copy JSON functionality, pasting them to separate files, and diffing in the terminal)
- Both workflows are running the "previous" version of crunch-run: the 3.1.0rc workflow uses crunch-run 3.0.0, while 3.1.1rc workflow uses crunch-run 3.1.0
- I don't have an explanation for this because at this point in the release process, there should be an rc version of crunch-run on the dispatch node, and with our configuration I believe it should be deploying itself to each compute node.
- The 3.1.0rc workflow uses arv-mount 3.0.0
- This is sort of a gap/bug in our release process: when we build a new compute node, that's not using the testing repository, so it picks up the last release version instead.
- The 3.1.1rc workflow is also using arv-mount 3.0.0
- Because I didn't build a new compute node because 3.1.1 doesn't have any changes that should "end up" on one
- The 3.1.1rc workflow is run with
--disable-reuse
. The 3.1.0rc workflow apparently wasn't: a lot of workflow steps are listed as reused.
Updated by Brett Smith 10 days ago
The 3.1.1rc workflow is run with
--disable-reuse
. The 3.1.0rc workflow apparently wasn't: a lot of workflow steps are listed as reused.
The last time we actually ran mark-duplicates was for 3.0.0~rc3. See pirca-xvhdp-vltlwwug3vfceqb and pirca-xvhdp-qk9ox6lq70s3ml7. Thankfully this ran with the correct crunch-run and arv-mount. But note it was workflow commit cf002b3d9d3149ced3d710dc343300e67568239b. Since then:
% git log --oneline cf002b3... e4d896f (HEAD -> main, origin/main, origin/HEAD) More description of WGS workflow 835bc23 Improve guidance on WGS a bit. 8a369be fix qcreport type c45e3be Improve WGS demo doc strings a bit. 4188b17 Merge branch 'wgs-no-keep-cache' 019c5b5 Remove keep_cache parameters because they are no longer needed 1acfef8 Bump fastq version 7bd8362 fix fastqc
I suspect the problem is one of the following:
- A bug introduced in crunch-run 3.1.0 that slipped under the radar because we reused results from 3.0.0.
- A bug introduced in the workflow in one of the commits above that is masked by container reuse.
I am running the workflow on jutro. It is still running 3.1.0. If we see the problem occur there, then I'm right that the problem is one of the above, then it proves that 3.1.1 is not the issue. jutro-xvhdp-ou1ttl3lwwsw3y3
Updated by Brett Smith 10 days ago
And to come at this from the other direction, here's the older version of the workflow (cf002b3d) running on 3.1.1rc2: pirca-xvhdp-42jcu9ldgpp31an
Updated by Brett Smith 10 days ago
- Blocks Support #22718: Release Arvados 3.1.1 added
Updated by Brett Smith 10 days ago
Brett Smith wrote in #note-2:
The 3.1.1rc workflow is run with
--disable-reuse
. The 3.1.0rc workflow apparently wasn't: a lot of workflow steps are listed as reused.The last time we actually ran mark-duplicates was for 3.0.0~rc3. See pirca-xvhdp-vltlwwug3vfceqb and pirca-xvhdp-qk9ox6lq70s3ml7. Thankfully this ran with the correct crunch-run and arv-mount. But note it was workflow commit cf002b3d9d3149ced3d710dc343300e67568239b. Since then:
[...]
I suspect the problem is one of the following:
- A bug introduced in crunch-run 3.1.0 that slipped under the radar because we reused results from 3.0.0.
- A bug introduced in the workflow in one of the commits above that is masked by container reuse.
jutro-xvhdp-ou1ttl3lwwsw3y3: Current main workflow running on 3.1.0, failed
pirca-xvhdp-42jcu9ldgpp31an: Previously-successful worrkflow running on 3.1.1rc2, failed
I think a bug in crunch-run or maybe a-c-r 3.1.0 is the most likely explanation.
Updated by Brett Smith 10 days ago
- Related to Bug #22617: Compute node images don't ensure encrypted partitions because of cycle with docker.socket added
Updated by Peter Amstutz 9 days ago
2025-04-08T21:58:52.410102783Z crunch-run 3.1.0 (go1.23.6) started 2025-04-08T21:58:52.410406930Z crunch-run process has uid=0(root) gid=0(root) groups=0(root) 2025-04-08T21:58:53.136835237Z Using FUSE mount: /usr/bin/arv-mount 3.0.0
I would expect this to be at least crunch-run 3.1.1rc2
(because arvados-dispatch-cloud
was updated) and arv-mount 3.1.0
(because the compute AMI was updated).
Updated by Peter Amstutz 9 days ago
You're right that this is actually a 3.1.0 bug and it wasn't caught in the 3.1.0 release because I didn't forgot to add --disable-reuse
. So that is on me. I'm glad we caught it and are tracking in down now.
Updated by Peter Amstutz 9 days ago
It shouldn't be possible for a "user process" like arvados-cwl-runner
to crash crunch-run
or arvados-dispatch-cloud
like this so even if it was feeding bad information somehow, the bug would still be that a system component like crunch-run
crashed.
Updated by Brett Smith 9 days ago
Updated by Brett Smith 9 days ago
- Subject changed from Figure out why WGS-processing chr19 is failing on 3.1.1rc2 to Figure out why WGS-processing chr19 crashes crunch-run 3.1.0
Updated by Brett Smith 9 days ago
Brett Smith wrote in #note-1:
- Both workflows are running the "previous" version of crunch-run: the 3.1.0rc workflow uses crunch-run 3.0.0, while 3.1.1rc workflow uses crunch-run 3.1.0
- I don't have an explanation for this because at this point in the release process, there should be an rc version of crunch-run on the dispatch node, and with our configuration I believe it should be deploying itself to each compute node.
For 3.1.1rc2 the explanation is #22755. Note in particular that #22755#note-5 says that 3.1.1rc2 has been deployed to pirca, but I don't see any evidence of that. See follow-ups on that ticket.
Updated by Brett Smith 9 days ago
Brett Smith wrote in #note-1:
- Both workflows are running the "previous" version of crunch-run: the 3.1.0rc workflow uses crunch-run 3.0.0, while 3.1.1rc workflow uses crunch-run 3.1.0
- I don't have an explanation for this because at this point in the release process, there should be an rc version of crunch-run on the dispatch node, and with our configuration I believe it should be deploying itself to each compute node.
This happened because of a bug in our Salt configuration. See #22766.
Updated by Brett Smith 9 days ago
Brett Smith wrote in #note-1:
- The 3.1.0rc workflow uses arv-mount 3.0.0
- This is sort of a gap/bug in our release process: when we build a new compute node, that's not using the testing repository, so it picks up the last release version instead.
This was not quite correct. The Jenkins job to build compute nodes does correctly select the testing repository. However, it still used the default package pins, so it selected the previous stable release from that repository. I have updated the job to disable Arvados package pins for these clusters.
Updated by Brett Smith 9 days ago
- Status changed from In Progress to Closed
At this point we understand why the test workflow "passed" for 3.1.0, fails on 3.1.1, and we have split out several tickets to address those root causes. I am going to close this and continue investigation for the cause of the crash itself on the original ticket #22617 just so we don't have to keep looking for updates in two places.