Bug #16600
closedCompute nodes missing attached scratch space
Description
Compute nodes on lugli are not mounting their scratch space, as a result computations do not get all the disk space they were promised.
This seems to be a missing command or script in the compute node image.
You can see here
https://collections.lugli.arvadosapi.com/c=9ff293ff126c3eaa44b8f654f2e6e7df-823/_/log%20for%20container%20lugli-dz642-jjb7ae2s9z9eal8/node-info.txt
When it is collecting information about the node it runs `df` which is not showing a data disk mounted.
Updated by Peter Amstutz over 4 years ago
- Status changed from New to In Progress
Updated by Javier Bértoli over 4 years ago
As Ward suggested, reused the code but it fails:
- AWS base AMI don't have LVM (which we use to manage the ephemeral images)
arvados-docker-cleaner
fails with this errorJul 16 10:50:34 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Main process exited, code=exited, status=1/FAILURE Jul 16 10:50:34 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Failed with result 'exit-code'. Jul 16 10:50:37 ip-10-254-254-103 dhclient[450]: XMT: Solicit on ens5, interval 121430ms. Jul 16 10:50:41 ip-10-254-254-103 sudo[4925]: admin : TTY=unknown ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/docker ps -q Jul 16 10:50:41 ip-10-254-254-103 sudo[4925]: pam_unix(sudo:session): session opened for user root by (uid=0) Jul 16 10:50:41 ip-10-254-254-103 sudo[4925]: pam_unix(sudo:session): session closed for user root Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Service RestartSec=10s expired, scheduling restart. Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Scheduled restart job, restart counter is at 47. Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: Stopped Arvados Docker Image Cleaner. Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: Started Arvados Docker Image Cleaner. Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: Traceback (most recent call last): Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/bin/arvados-docker-cleaner", line 5, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: from arvados_docker.cleaner import main Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/arvados_docker/cleaner.py", line 21, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: import docker Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/__init__.py", line 20, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: from .client import Client, AutoVersionClient # flake8: noqa Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/client.py", line 25, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: from . import api Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/api/__init__.py", line 2, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: from .build import BuildApiMixin Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/api/build.py", line 9, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: from .. import utils Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/utils/__init__.py", line 1, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: from .utils import ( Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/site-packages/docker/utils/utils.py", line 24, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: from distutils.version import StrictVersion Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: File "/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/distutils/__init__.py", line 44, in <module> Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: from distutils import dist, sysconfig # isort:skip Jul 16 10:50:44 ip-10-254-254-103 sh[4933]: ImportError: cannot import name 'dist' from 'distutils' (/usr/share/python3/dist/arvados-docker-cleaner/lib/python3.7/distutils/__init__.py) Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Main process exited, code=exited, status=1/FAILURE Jul 16 10:50:44 ip-10-254-254-103 systemd[1]: arvados-docker-cleaner.service: Failed with result 'exit-code'.
- docker does not start
Updated by Javier Bértoli over 4 years ago
- Related to Bug #16611: arvados-docker-cleaner package broken on Debian 10 and Ubuntu 18.04 added
Updated by Javier Bértoli over 4 years ago
After refactoring the image, it didn't show a scratch space (perhaps my fault) but talking about this with Lucas & Nico I found:
- Documentation reference
IncludedScratch: 16GB AddedScratch: 0
- pirca/lugli (iirc, Peter added the node info in these clusters):
Scratch: 200GB AddedScratch: 200GB
- su92l:
Scratch: 100000000000 IncludedScratch: 100000000000
so, which is the correct format to use?
Updated by Tom Clegg over 4 years ago
This is preferred for current versions of Arvados:
AddedScratch: 0 # portion added separately
IncludedScratch: 100000000000 # portion included with node type
AddedScratch: 100000000000 # portion added separately
IncludedScratch: 0 # portion included with node type
This has the same effect as the first example, and is useful if you're trying to make the same config file work with an older version of Arvados that doesn't pay attention to Included/Attached:
Scratch: 100000000000 # total
IncludedScratch: 100000000000 # portion included with node type
Updated by Javier Bértoli over 4 years ago
- % Done changed from 0 to 100
- Assigned To changed from Javier Bértoli to Peter Amstutz
- Status changed from In Progress to Feedback
- Category set to Deployment
Should be fixed in commit 1d2304b@packer, branch compute-image-simplified-script and pushed via commit 9c740af@saltstack
On a running compute image:
root@ip-10-255-254-215:~# df -h Filesystem Size Used Avail Use% Mounted on ... /dev/nvme1n1p1 7.7G 1.6G 5.8G 22% / ... /dev/mapper/tmp 47G 81M 47G 1% /tmp
Testing with this job, waiting for it to finish to see if all is OK and close this issue.
Updated by Peter Amstutz over 4 years ago
- Status changed from Feedback to Resolved
This is working. The job failed because it ran out of RAM, but I can see from the stats that it used 10 GB of disk, when the root disk only have 5 GB available, so clearly it was using the scratch space.