Project

General

Profile

Actions

Bug #18713

closed

[gpu] nvidia-persistenced.service fails when booted on a node without GPUs

Added by Ward Vandewege 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
02/03/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

The systemd nvidia-persistenced.service fails to start when a compute image with Nvidia GPU support is started on a non-GPU node:

# systemctl
...
● nvidia-persistenced.service                                              loaded failed failed    NVIDIA Persistence Daemon         
...
# systemctl status nvidia-persistenced.service
● nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2022-02-03 18:38:16 UTC; 16min ago

Feb 03 18:38:15 ip-10-253-254-98 systemd[1]: Starting NVIDIA Persistence Daemon...
Feb 03 18:38:15 ip-10-253-254-98 nvidia-persistenced[559]: Started (559)
Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[559]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 108 has read and write permissions for those files.
Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[552]: nvidia-persistenced failed to initialize. Check syslog for more details.
Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[559]: Shutdown (559)
Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: Failed to start NVIDIA Persistence Daemon.

This is a problem because it means that

systemctl is-system-running

returns degraded. That command is our default for BootProbeCommand. In other words, the compute nodes never reach "ready" state from Arvados' perspective.


Files

tf-mnist-tutorial.py (1.29 KB) tf-mnist-tutorial.py Peter Amstutz, 02/03/2022 08:25 PM
tf-mnist-tutorial-gpu.cwl (770 Bytes) tf-mnist-tutorial-gpu.cwl Peter Amstutz, 02/03/2022 08:25 PM

Subtasks 1 (0 open1 closed)

Task #18714: review 18713-nvidia-persistencedResolvedPeter Amstutz02/03/2022

Actions

Related issues

Related to Arvados Epics - Story #15957: GPU supportResolved10/01/202103/31/2022

Actions
Actions #1

Updated by Ward Vandewege 6 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Ward Vandewege 6 months ago

  • Description updated (diff)
Actions #4

Updated by Ward Vandewege 6 months ago

Actions #6

Updated by Peter Amstutz 6 months ago

It is probably fine to have it disabled, because crunch-run does some GPU driver initialization on its own already.

Actions #7

Updated by Ward Vandewege 6 months ago

I updated the script that builds the compute node image to disable the nvidia-persistenced service in ac52d7ee23b39779712c702945eb9db7e17dd814 on branch 18713-nvidia-persistenced. Ready for review.

I then built a compute image for Tordo from this commit, and that made Tordo work again, cf. https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-x824fng56ciyvoo

Actions #8

Updated by Peter Amstutz 6 months ago

Ward Vandewege wrote:

I updated the script that builds the compute node image to disable the nvidia-persistenced service in ac52d7ee23b39779712c702945eb9db7e17dd814 on branch 18713-nvidia-persistenced. Ready for review.

I then built a compute image for Tordo from this commit, and that made Tordo work again, cf. https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-x824fng56ciyvoo

In the comment I would include a note that this doesn't matter, because crunch-run does its own basic CUDA initialization.

We should also confirm that in fact GPUs still work.

Actions #9

Updated by Peter Amstutz 6 months ago

Actions #10

Updated by Peter Amstutz 6 months ago

  • Target version set to 2022-02-16 sprint
  • Assigned To set to Ward Vandewege
Actions #11

Updated by Ward Vandewege 6 months ago

Peter Amstutz wrote:

Ward Vandewege wrote:

I updated the script that builds the compute node image to disable the nvidia-persistenced service in ac52d7ee23b39779712c702945eb9db7e17dd814 on branch 18713-nvidia-persistenced. Ready for review.

I then built a compute image for Tordo from this commit, and that made Tordo work again, cf. https://workbench.tordo.arvadosapi.com/container_requests/tordo-xvhdp-x824fng56ciyvoo

In the comment I would include a note that this doesn't matter, because crunch-run does its own basic CUDA initialization.

Sure, updated in 12c1c51313e897abd0e9d1801b42bc8dc3b8d1d9 on branch 18713-nvidia-persistenced

We should also confirm that in fact GPUs still work.

Thanks for the sample workflow, it completed at tordo-xvhdp-h7cu2u53dtjf3ag (without reuse!).

Actions #12

Updated by Peter Amstutz 6 months ago

LGTM

Actions #13

Updated by Ward Vandewege 6 months ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|8685251f024c4519c5f61413b9dcb66a86e3efd6.

Actions #14

Updated by Peter Amstutz 5 months ago

  • Release set to 46
Actions

Also available in: Atom PDF