Task #22512
closedSupport #22510: Release Arvados 3.1.0
2. Build and test new tordo compute node, update third-party package pin versions
Files
Updated by Brett Smith 2 months ago
- Assigned To set to Brett Smith
- Status changed from New to In Progress
Updated by Brett Smith 2 months ago
New image: packer-build-compute-image: #305 - ami-08602fed623133256
Test run: tordo-xvhdp-asvj8h83ke63sh0
Updated by Brett Smith 2 months ago
Brett Smith wrote in #note-2:
New image: packer-build-compute-image: #305
- ami-08602fed623133256
This test is bad because it pinned the packages. We want to see the "Install <software> package pins" tasks skipped. Added one configuration line to Jenkins to do that.
Now trying packer-build-compute-image: #306 but it seems to be stuck waiting for an executor. I'm just gonna leave it for now, if there's no useful progress by morning I'll have to dig in.
Updated by Brett Smith 2 months ago
Brett Smith wrote in #note-3:
Now trying packer-build-compute-image: #306
Package list attached. No major changes except maybe CUDA. WGS workflow: tordo-xvhdp-6nh5z156inxcdkr
Updated by Brett Smith 2 months ago
Brett Smith wrote in #note-4:
Package list attached. No major changes except maybe CUDA. WGS workflow: tordo-xvhdp-6nh5z156inxcdkr
No really this time. The workflow failed but it seems like a workflow bug or possibly #22466. Changes:
cuda 12.5 → 12.6
nvidia-container 1.16 → 1.17
Do we have a way to actually test these? Do we want to?
Updated by Peter Amstutz 2 months ago
Brett Smith wrote in #note-5:
Brett Smith wrote in #note-4:
Package list attached. No major changes except maybe CUDA. WGS workflow: tordo-xvhdp-6nh5z156inxcdkr
No really this time. The workflow failed but it seems like a workflow bug or possibly #22466. Changes:
cuda 12.5 → 12.6
nvidia-container 1.16 → 1.17Do we have a way to actually test these? Do we want to?
That workflow doesn't use CUDA.
These are failing with "Cancelled after exceeding MaxDispatchAttempts" which makes me think it we're getting intermittently broken compute nodes.
Updated by Brett Smith 2 months ago
Peter Amstutz wrote in #note-6:
Brett Smith wrote in #note-5:
cuda 12.5 → 12.6
nvidia-container 1.16 → 1.17Do we have a way to actually test these? Do we want to?
That workflow doesn't use CUDA.
Yes I know, hence my questions.
These are failing with "Cancelled after exceeding MaxDispatchAttempts" which makes me think it we're getting intermittently broken compute nodes.
Given that the image build itself took over an hour to even get started I suspect there was some cloud weather last night. Let's just retry for starters: tordo-xvhdp-dtcx4b0fvc0gni0
Updated by Brett Smith 29 days ago
- Remaining (hours) set to 0.0
- Status changed from In Progress to Resolved
Pins got updated in #22612, specifically 8b220be113f6abc0b15691371a0ac34de4c0076a. A CUDA test specifically is written up in #22612#note-6.