Bug #17816
closedsingularity not setting working directory
Description
As part of testing in #17755, it appears that there is a problem with /tmp (?) not being writable in our singularity containers. a-c-r relies on this, and many other tools will likely do too. It seems a bit unreasonable for /tmp to be unwritable, if that is really the problem. Maybe we can change something about the way we are invoking singularity?
See e.g. ce8i5-xvhdp-1ubjhuo87i24ora:
2021-06-16T20:15:21.003947318Z INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210610215458, arvados-python-client 2.3.0.dev20210610215458, cwltool 3.0.20210319143721 2021-06-16T20:15:21.022483370Z INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main' 2021-06-16T20:15:23.231706231Z INFO Using cluster ce8i5 (https://workbench2.ce8i5.arvadosapi.com/) 2021-06-16T20:15:30.386974208Z INFO Using collection cache size 256 MiB 2021-06-16T20:15:30.448702283Z INFO Running inside container ce8i5-dz642-9asvb2g41z514n8 2021-06-16T20:15:30.665440997Z INFO [workflow workflow.json#main] start 2021-06-16T20:15:30.666220299Z INFO [workflow workflow.json#main] starting step substep 2021-06-16T20:15:30.666896701Z INFO [step substep] start 2021-06-16T20:15:31.473008286Z WARNING X-Keep-Storage-Classes header not supported by the cluster 2021-06-16T20:15:31.700830931Z INFO Using collection ce8i5-4zz18-df50zijeqkpbdaf 2021-06-16T20:15:36.635135414Z ERROR Unexpected exception 2021-06-16T20:15:36.635135414Z Traceback (most recent call last): 2021-06-16T20:15:36.635135414Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job 2021-06-16T20:15:36.635135414Z runtimeContext, 2021-06-16T20:15:36.635135414Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job 2021-06-16T20:15:36.635135414Z j.stagedir = runtimeContext.create_tmpdir() 2021-06-16T20:15:36.635135414Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir 2021-06-16T20:15:36.635135414Z return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir) 2021-06-16T20:15:36.635135414Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp 2021-06-16T20:15:36.635135414Z _os.mkdir(file, 0o700) 2021-06-16T20:15:36.635135414Z OSError: [Errno 30] Read-only file system: 'tmpph0scv23' 2021-06-16T20:15:36.831953372Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpph0scv23' 2021-06-16T20:15:37.120042388Z INFO [step substep] start 2021-06-16T20:15:37.240278629Z ERROR Unexpected exception 2021-06-16T20:15:37.240278629Z Traceback (most recent call last): 2021-06-16T20:15:37.240278629Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job 2021-06-16T20:15:37.240278629Z runtimeContext, 2021-06-16T20:15:37.240278629Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job 2021-06-16T20:15:37.240278629Z j.stagedir = runtimeContext.create_tmpdir() 2021-06-16T20:15:37.240278629Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir 2021-06-16T20:15:37.240278629Z return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir) 2021-06-16T20:15:37.240278629Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp 2021-06-16T20:15:37.240278629Z _os.mkdir(file, 0o700) 2021-06-16T20:15:37.240278629Z OSError: [Errno 30] Read-only file system: 'tmpi4ayzhq8' 2021-06-16T20:15:37.383785035Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpi4ayzhq8' 2021-06-16T20:15:37.546091895Z INFO [step substep] start 2021-06-16T20:15:37.660351019Z ERROR Unexpected exception 2021-06-16T20:15:37.660351019Z Traceback (most recent call last): 2021-06-16T20:15:37.660351019Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job 2021-06-16T20:15:37.660351019Z runtimeContext, 2021-06-16T20:15:37.660351019Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job 2021-06-16T20:15:37.660351019Z j.stagedir = runtimeContext.create_tmpdir() 2021-06-16T20:15:37.660351019Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir 2021-06-16T20:15:37.660351019Z return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir) 2021-06-16T20:15:37.660351019Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp 2021-06-16T20:15:37.660351019Z _os.mkdir(file, 0o700) 2021-06-16T20:15:37.660351019Z OSError: [Errno 30] Read-only file system: 'tmpybc3dv2f' 2021-06-16T20:15:37.867057605Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpybc3dv2f' 2021-06-16T20:15:37.985364440Z INFO [step substep] start 2021-06-16T20:15:38.084313721Z ERROR Unexpected exception 2021-06-16T20:15:38.084313721Z Traceback (most recent call last): 2021-06-16T20:15:38.084313721Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job 2021-06-16T20:15:38.084313721Z runtimeContext, 2021-06-16T20:15:38.084313721Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job 2021-06-16T20:15:38.084313721Z j.stagedir = runtimeContext.create_tmpdir() 2021-06-16T20:15:38.084313721Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir 2021-06-16T20:15:38.084313721Z return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir) 2021-06-16T20:15:38.084313721Z File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp 2021-06-16T20:15:38.084313721Z _os.mkdir(file, 0o700) 2021-06-16T20:15:38.084313721Z OSError: [Errno 30] Read-only file system: 'tmp8_atoptz' 2021-06-16T20:15:38.254878304Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmp8_atoptz' 2021-06-16T20:15:38.424077683Z WARNING [step substep] completed permanentFail 2021-06-16T20:15:38.585621241Z INFO [workflow workflow.json#main] completed permanentFail 2021-06-16T20:15:38.585821742Z ERROR Overall process status is permanentFail
Updated by Ward Vandewege over 3 years ago
- Description updated (diff)
- Subject changed from [singularity] /tmp (?) is not writeable to [singularity] error: read-only file system
Updated by Ward Vandewege over 3 years ago
- Related to Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Updated by Ward Vandewege over 3 years ago
- Related to deleted (Idea #17755: Test singularity support on a cloud cluster by running some real workflows)
Updated by Ward Vandewege over 3 years ago
- Blocked by Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Updated by Ward Vandewege over 3 years ago
- Blocked by deleted (Idea #17755: Test singularity support on a cloud cluster by running some real workflows)
Updated by Ward Vandewege over 3 years ago
- Blocks Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Updated by Tom Clegg over 3 years ago
Proposed solution: When creating a container request, api/controller has a (configurable?) list of mount points like /tmp
and /var/tmp
that automatically get added as mounts (as if they had been specified with {"kind":"tmp","capacity":10000000,"device_type":"disk"}) if the container request spec does not mount anything at or above that point.
The current implementation of crunch-run ignores the "capacity" argument (it's used elsewhere for choosing a node type, but crunch-run doesn't try to limit usage at runtime) so the arbitrary size 10000000 doesn't really matter.
Arguably, the requester really should be specifying these mounts explicitly instead of expecting the entire filesystem to be writable -- but configurable automatic/implicit mounts should make the migration much easier, with the option to turn it off once all clients/workflows are updated with explicit mounts.
Updated by Tom Clegg over 3 years ago
- Related to Idea #16305: Singularity support added
Updated by Peter Amstutz over 3 years ago
- Target version deleted (
To Be Groomed)
Updated by Ward Vandewege over 3 years ago
- Subject changed from [singularity] error: read-only file system to [singularity] a-c-r should add /tmp to its job description, or not use it
Updated by Ward Vandewege over 3 years ago
- Target version set to 2021-07-21 sprint
Updated by Ward Vandewege over 3 years ago
- Subject changed from [singularity] a-c-r should add /tmp to its job description, or not use it to [a-c-r] should add /tmp to its job description, or not use it (affects singularity which has a read-only container filesystem)
Updated by Peter Amstutz over 3 years ago
17816-crunch-dispatch-singularity
Adds --runtime-engine to the crunch-run invocation of crunch-dispatch-local and crunch-dispatch-slurm.
Updated by Lucas Di Pentima over 3 years ago
Some comments & questions:
- If we're starting to use the config loader on
crunch-dispatch-local
, do you think it would be convenient to create the arvados client (line 89) from the config data instead of an env var? That way we could avoid potentially difficult-to-debug problems where a dispatcher runs against one cluster but uses the runtime engine of another. - Related to the above comment: I think a migration note should be added if
c-d-l
depends on the config file being present. - I'm assuming that we don't have any integration tests for singularity +
c-d-l
, does it make sense to add some?
Updated by Peter Amstutz over 3 years ago
- Subject changed from [a-c-r] should add /tmp to its job description, or not use it (affects singularity which has a read-only container filesystem) to singularity not setting working directory
Updated by Peter Amstutz over 3 years ago
As it turns out, a-c-r isn't trying to create temp directories in /tmp, they get created in the current directory.
This revealed the actual problem which was that crunch-run was not setting the working directory when using singularity, so the program was being started in /root (not writable) instead of /var/spool/cwl (writable).
Updated by Peter Amstutz over 3 years ago
Lucas Di Pentima wrote:
Some comments & questions:
- If we're starting to use the config loader on
crunch-dispatch-local
, do you think it would be convenient to create the arvados client (line 89) from the config data instead of an env var? That way we could avoid potentially difficult-to-debug problems where a dispatcher runs against one cluster but uses the runtime engine of another.- Related to the above comment: I think a migration note should be added if
c-d-l
depends on the config file being present.
Uses the cluster config now. Added upgrade note.
- I'm assuming that we don't have any integration tests for singularity +
c-d-l
, does it make sense to add some?
I don't think so, it isn't like our unit test framework could have helped discover that the feature didn't exist, it required the kind of system testing that I was already doing.
17816-crunch-dispatch-singularity @ f2ee5bac37391ce9fe084306da332becd7620ca7
In addition, I fixed the actual original bug, which was the "read only file system" error. I also discovered an apparent discrepancy between the comment on MarshalManifest (that it is supposed to flush before getting the manifest text) and the actual behavior (it doesn't) -- fixed by adding a call to Flush().
17816-singularity-cwd @ eec5086af5c2d1c1f17bbc525cc68d394c9680f4
Updated by Peter Amstutz over 3 years ago
Ran the entire CWL test suite. Getting a semi-random failure:
container creation failed: mount /tmp/crunch-run.x2z00-dz642-38q11t2a8r1d3tv.087008439/keep438609578/by_id/04f89c0db086d2496544715d9ddc4875+72/renamed-filelist.txt->/var/spool/cwl/renamed-filelist.txt error: while mounting /tmp/crunch-run.x2z00-dz642-38q11t2a8r1d3tv.087008439/keep438609578/by_id/04f89c0db086d2496544715d9ddc4875+72/renamed-filelist.txt: destination /var/spool/cwl/renamed-filelist.txt doesn't exist in container
/var/spool/cwl
is the output directory and /var/spool/cwl/renamed-filelist.txt
is a file that's supposed to be staged in the output directory.
It doesn't fail every time so I suspect that we need to sort the bind mounts to ensure that /var/spool/cwl
is included on the command line before /var/spool/cwl/renamed-filelist.txt
Updated by Tom Clegg over 3 years ago
17816-stdout-stderr-race @ 4879256386a5be9566f31f2c266b682993029e14 -- developer-run-tests: #2585
Updated by Tom Clegg over 3 years ago
- commit eec5086af5c2d1c1f17bbc525cc68d394c9680f4 looks good
- I dropped the commit with the unrelated Flush() and Close() additions -- I think 17816-stdout-stderr-race is a better solution for that
- I added a test
- now it's 17816-singularity-cwd @ d08083912c64b429e4ec06b9a42edd001c1e52a6 -- developer-run-tests: #2586
Updated by Peter Amstutz over 3 years ago
17816-singularity-cwd @ 9a1c67deabd249e068284bb86f148d4aa9998711
- Rebased onto 17816-stdout-stderr-race
- Includes the --runtime-engine fixes
- Includes 13b4d219384a81141846588b20f07792d64cb489 (adds --pwd)
- Includes your test for pwd (696f75b0c857a01b31205411cdef7a20fe7b93fe)
- Includes another fix to sort the bind mounts command line for when mounts are nested