Bug #16600

Compute nodes missing attached scratch space

Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Story points:
-

Description

Compute nodes on lugli are not mounting their scratch space, as a result computations do not get all the disk space they were promised.

This seems to be a missing command or script in the compute node image.

You can see here

https://collections.lugli.arvadosapi.com/c=9ff293ff126c3eaa44b8f654f2e6e7df-823/_/log%20for%20container%20lugli-dz642-jjb7ae2s9z9eal8/node-info.txt

When it is collecting information about the node it runs `df` which is not showing a data disk mounted.


Related issues

Related to Arvados - Bug #16611: arvados-docker-cleaner package broken on Debian 10 and Ubuntu 18.04Resolved

History

#1 Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#3 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#4 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#5 Updated by Javier Bértoli over 1 year ago

As Ward suggested, reused the code but it fails:

  • AWS base AMI don't have LVM (which we use to manage the ephemeral images)
  • arvados-docker-cleaner fails with this error
    Jul 16 10:50:34 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Main process exited, code=exited, status=1/FAILURE
    Jul 16 10:50:34 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Failed with result 'exit-code'.
    Jul 16 10:50:37 ip-10-254-254-103 dhclient[450]: XMT: Solicit on ens5, interval 121430ms.
    Jul 16 10:50:41 ip-10-254-254-103 sudo[4925]:    admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/docker ps -q
    Jul 16 10:50:41 ip-10-254-254-103 sudo[4925]: pam_unix(sudo:session): session opened for user root by (uid=0)
    Jul 16 10:50:41 ip-10-254-254-103 sudo[4925]: pam_unix(sudo:session): session closed for user root
    Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Service RestartSec=10s expired, scheduling restart.
    Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Scheduled restart job, restart counter is at 47.
    Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: Stopped Arvados Docker Image Cleaner.
    Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: Started Arvados Docker Image Cleaner.
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: Traceback (most recent call last):
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/bin/arvados-docker-cleaner", line 5, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     from arvados_docker.cleaner import main
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/arvados_docker/cleaner.py", line 21, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     import docker
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/__init__.py", line 20, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     from .client import Client, AutoVersionClient # flake8: noqa
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/client.py", line 25, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     from . import api
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/api/__init__.py", line 2, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     from .build import BuildApiMixin
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/api/build.py", line 9, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     from .. import utils
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/utils/__init__.py", line 1, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     from .utils import (
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/utils/utils.py", line 24, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     from distutils.version import StrictVersion
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:   File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/distutils/__init__.py", line 44, in <module>
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]:     from distutils import dist, sysconfig  # isort:skip
    Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: ImportError: cannot import name 'dist' from 'distutils' (/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/distutils/__init__.py)
    Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Main process exited, code=exited, status=1/FAILURE
    Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Failed with result 'exit-code'.
    
  • docker does not start

#6 Updated by Javier Bértoli over 1 year ago

  • Related to Bug #16611: arvados-docker-cleaner package broken on Debian 10 and Ubuntu 18.04 added

#7 Updated by Javier Bértoli over 1 year ago

After refactoring the image, it didn't show a scratch space (perhaps my fault) but talking about this with Lucas & Nico I found:

  • pirca/lugli (iirc, Peter added the node info in these clusters):
            Scratch: 200GB
            AddedScratch: 200GB
    
  • su92l:
              Scratch: 100000000000
              IncludedScratch: 100000000000 
    

so, which is the correct format to use?

#8 Updated by Tom Clegg over 1 year ago

This is preferred for current versions of Arvados:

AddedScratch: 0                # portion added separately
IncludedScratch: 100000000000 # portion included with node type
AddedScratch: 100000000000     # portion added separately
IncludedScratch: 0 # portion included with node type

This has the same effect as the first example, and is useful if you're trying to make the same config file work with an older version of Arvados that doesn't pay attention to Included/Attached:

Scratch: 100000000000          # total
IncludedScratch: 100000000000 # portion included with node type

#9 Updated by Javier Bértoli over 1 year ago

  • % Done changed from 0 to 100
  • Assigned To changed from Javier Bértoli to Peter Amstutz
  • Status changed from In Progress to Feedback
  • Category set to Deployment

Should be fixed in commit 1d2304b@packer, branch compute-image-simplified-script and pushed via commit 9c740af@saltstack

On a running compute image:

root@ip-10-255-254-215:~# df -h
Filesystem       Size  Used Avail Use% Mounted on
...
/dev/nvme1n1p1   7.7G  1.6G  5.8G  22% /
...
/dev/mapper/tmp   47G   81M   47G   1% /tmp

Testing with this job, waiting for it to finish to see if all is OK and close this issue.

#11 Updated by Peter Amstutz over 1 year ago

  • Status changed from Feedback to Resolved

This is working. The job failed because it ran out of RAM, but I can see from the stats that it used 10 GB of disk, when the root disk only have 5 GB available, so clearly it was using the scratch space.

Also available in: Atom PDF