Project

General

Profile

Bug #18713

Updated by Ward Vandewege 6 months ago

The systemd nvidia-persistenced.service fails to start when a compute image with Nvidia GPU support is started on a non-GPU node: 

 <pre> 
 # systemctl 
 ... 
 ● nvidia-persistenced.service                                                loaded failed failed      NVIDIA Persistence Daemon          
 ... 
 </pre> 

 <pre> 
 # systemctl status nvidia-persistenced.service 
 ● nvidia-persistenced.service - NVIDIA Persistence Daemon 
    Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled) 
    Active: failed (Result: exit-code) since Thu 2022-02-03 18:38:16 UTC; 16min ago 

 Feb 03 18:38:15 ip-10-253-254-98 systemd[1]: Starting NVIDIA Persistence Daemon... 
 Feb 03 18:38:15 ip-10-253-254-98 nvidia-persistenced[559]: Started (559) 
 Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[559]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 108 has read and write permissions for those files. 
 Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[552]: nvidia-persistenced failed to initialize. Check syslog for more details. 
 Feb 03 18:38:16 ip-10-253-254-98 nvidia-persistenced[559]: Shutdown (559) 
 Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE 
 Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'. 
 Feb 03 18:38:16 ip-10-253-254-98 systemd[1]: Failed to start NVIDIA Persistence Daemon. 
 </pre> 

 This is a problem because it means that 

   systemctl is-system-running 

 returns @degraded@. That command is our default for @BootProbeCommand@. In other words, the compute nodes never reach "ready" state from Arvados' perspective.

Back