Bug #17816

singularity not setting working directory

Added by Ward Vandewege 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/14/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

As part of testing in #17755, it appears that there is a problem with /tmp (?) not being writable in our singularity containers. a-c-r relies on this, and many other tools will likely do too. It seems a bit unreasonable for /tmp to be unwritable, if that is really the problem. Maybe we can change something about the way we are invoking singularity?

See e.g. ce8i5-xvhdp-1ubjhuo87i24ora:

2021-06-16T20:15:21.003947318Z INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210610215458, arvados-python-client 2.3.0.dev20210610215458, cwltool 3.0.20210319143721
2021-06-16T20:15:21.022483370Z INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
2021-06-16T20:15:23.231706231Z INFO Using cluster ce8i5 (https://workbench2.ce8i5.arvadosapi.com/)
2021-06-16T20:15:30.386974208Z INFO Using collection cache size 256 MiB
2021-06-16T20:15:30.448702283Z INFO Running inside container ce8i5-dz642-9asvb2g41z514n8
2021-06-16T20:15:30.665440997Z INFO [workflow workflow.json#main] start
2021-06-16T20:15:30.666220299Z INFO [workflow workflow.json#main] starting step substep
2021-06-16T20:15:30.666896701Z INFO [step substep] start
2021-06-16T20:15:31.473008286Z WARNING X-Keep-Storage-Classes header not supported by the cluster
2021-06-16T20:15:31.700830931Z INFO Using collection ce8i5-4zz18-df50zijeqkpbdaf
2021-06-16T20:15:36.635135414Z ERROR Unexpected exception
2021-06-16T20:15:36.635135414Z Traceback (most recent call last):
2021-06-16T20:15:36.635135414Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job
2021-06-16T20:15:36.635135414Z     runtimeContext,
2021-06-16T20:15:36.635135414Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job
2021-06-16T20:15:36.635135414Z     j.stagedir = runtimeContext.create_tmpdir()
2021-06-16T20:15:36.635135414Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir
2021-06-16T20:15:36.635135414Z     return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir)
2021-06-16T20:15:36.635135414Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp
2021-06-16T20:15:36.635135414Z     _os.mkdir(file, 0o700)
2021-06-16T20:15:36.635135414Z OSError: [Errno 30] Read-only file system: 'tmpph0scv23'
2021-06-16T20:15:36.831953372Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpph0scv23'
2021-06-16T20:15:37.120042388Z INFO [step substep] start
2021-06-16T20:15:37.240278629Z ERROR Unexpected exception
2021-06-16T20:15:37.240278629Z Traceback (most recent call last):
2021-06-16T20:15:37.240278629Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job
2021-06-16T20:15:37.240278629Z     runtimeContext,
2021-06-16T20:15:37.240278629Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job
2021-06-16T20:15:37.240278629Z     j.stagedir = runtimeContext.create_tmpdir()
2021-06-16T20:15:37.240278629Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir
2021-06-16T20:15:37.240278629Z     return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir)
2021-06-16T20:15:37.240278629Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp
2021-06-16T20:15:37.240278629Z     _os.mkdir(file, 0o700)
2021-06-16T20:15:37.240278629Z OSError: [Errno 30] Read-only file system: 'tmpi4ayzhq8'
2021-06-16T20:15:37.383785035Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpi4ayzhq8'
2021-06-16T20:15:37.546091895Z INFO [step substep] start
2021-06-16T20:15:37.660351019Z ERROR Unexpected exception
2021-06-16T20:15:37.660351019Z Traceback (most recent call last):
2021-06-16T20:15:37.660351019Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job
2021-06-16T20:15:37.660351019Z     runtimeContext,
2021-06-16T20:15:37.660351019Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job
2021-06-16T20:15:37.660351019Z     j.stagedir = runtimeContext.create_tmpdir()
2021-06-16T20:15:37.660351019Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir
2021-06-16T20:15:37.660351019Z     return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir)
2021-06-16T20:15:37.660351019Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp
2021-06-16T20:15:37.660351019Z     _os.mkdir(file, 0o700)
2021-06-16T20:15:37.660351019Z OSError: [Errno 30] Read-only file system: 'tmpybc3dv2f'
2021-06-16T20:15:37.867057605Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpybc3dv2f'
2021-06-16T20:15:37.985364440Z INFO [step substep] start
2021-06-16T20:15:38.084313721Z ERROR Unexpected exception
2021-06-16T20:15:38.084313721Z Traceback (most recent call last):
2021-06-16T20:15:38.084313721Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job
2021-06-16T20:15:38.084313721Z     runtimeContext,
2021-06-16T20:15:38.084313721Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job
2021-06-16T20:15:38.084313721Z     j.stagedir = runtimeContext.create_tmpdir()
2021-06-16T20:15:38.084313721Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir
2021-06-16T20:15:38.084313721Z     return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir)
2021-06-16T20:15:38.084313721Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp
2021-06-16T20:15:38.084313721Z     _os.mkdir(file, 0o700)
2021-06-16T20:15:38.084313721Z OSError: [Errno 30] Read-only file system: 'tmp8_atoptz'
2021-06-16T20:15:38.254878304Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmp8_atoptz'
2021-06-16T20:15:38.424077683Z WARNING [step substep] completed permanentFail
2021-06-16T20:15:38.585621241Z INFO [workflow workflow.json#main] completed permanentFail
2021-06-16T20:15:38.585821742Z ERROR Overall process status is permanentFail

Subtasks

Task #17871: Review 17816-singularity-cwdResolvedTom Clegg

Task #17907: fixResolvedPeter Amstutz

Task #17915: Review 17816-crunch-dispatch-singularityResolvedTom Clegg


Related issues

Related to Arvados Epics - Story #16305: Singularity supportResolved01/01/202109/30/2021

Blocks Arvados - Story #17755: Test singularity support on a cloud cluster by running some real workflowsResolved09/03/2021

Associated revisions

Revision 237f9a7c (diff)
Added by Peter Amstutz 4 months ago

17816: Add --runtime-engine to crunch-dispatch-local and crunch-dispatch-slurm

refs #17816

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 570af13f (diff)
Added by Peter Amstutz 3 months ago

17816: Add --runtime-engine to crunch-dispatch-local and crunch-dispatch-slurm

refs #17816

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 8adcf378
Added by Peter Amstutz 3 months ago

Merge branch '17816-singularity-cwd' into main refs #17816

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 64813b8e (diff)
Added by Ward Vandewege 3 months ago

Fix arvbox demo image build.

refs #17816

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

Revision b398ebef (diff)
Added by Peter Amstutz 3 months ago

17816: Add --runtime-engine to crunch-dispatch-local and crunch-dispatch-slurm

refs #17816

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision b13e32f7 (diff)
Added by Ward Vandewege 3 months ago

Fix arvbox demo image build.

refs #17816

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

History

#1 Updated by Ward Vandewege 4 months ago

  • Description updated (diff)
  • Subject changed from [singularity] /tmp (?) is not writeable to [singularity] error: read-only file system

#2 Updated by Ward Vandewege 4 months ago

  • Related to Story #17755: Test singularity support on a cloud cluster by running some real workflows added

#3 Updated by Ward Vandewege 4 months ago

  • Related to deleted (Story #17755: Test singularity support on a cloud cluster by running some real workflows)

#4 Updated by Ward Vandewege 4 months ago

  • Blocked by Story #17755: Test singularity support on a cloud cluster by running some real workflows added

#5 Updated by Ward Vandewege 4 months ago

  • Blocked by deleted (Story #17755: Test singularity support on a cloud cluster by running some real workflows)

#6 Updated by Ward Vandewege 4 months ago

  • Blocks Story #17755: Test singularity support on a cloud cluster by running some real workflows added

#7 Updated by Tom Clegg 4 months ago

Proposed solution: When creating a container request, api/controller has a (configurable?) list of mount points like /tmp and /var/tmp that automatically get added as mounts (as if they had been specified with {"kind":"tmp","capacity":10000000,"device_type":"disk"}) if the container request spec does not mount anything at or above that point.

The current implementation of crunch-run ignores the "capacity" argument (it's used elsewhere for choosing a node type, but crunch-run doesn't try to limit usage at runtime) so the arbitrary size 10000000 doesn't really matter.

Arguably, the requester really should be specifying these mounts explicitly instead of expecting the entire filesystem to be writable -- but configurable automatic/implicit mounts should make the migration much easier, with the option to turn it off once all clients/workflows are updated with explicit mounts.

#8 Updated by Tom Clegg 4 months ago

#9 Updated by Peter Amstutz 4 months ago

  • Target version deleted (To Be Groomed)

#10 Updated by Ward Vandewege 4 months ago

  • Subject changed from [singularity] error: read-only file system to [singularity] a-c-r should add /tmp to its job description, or not use it

#11 Updated by Ward Vandewege 4 months ago

  • Target version set to 2021-07-21 sprint

#12 Updated by Ward Vandewege 4 months ago

  • Subject changed from [singularity] a-c-r should add /tmp to its job description, or not use it to [a-c-r] should add /tmp to its job description, or not use it (affects singularity which has a read-only container filesystem)

#13 Updated by Peter Amstutz 4 months ago

  • Assigned To set to Peter Amstutz

#14 Updated by Peter Amstutz 4 months ago

17816-crunch-dispatch-singularity

Adds --runtime-engine to the crunch-run invocation of crunch-dispatch-local and crunch-dispatch-slurm.

arvados|753d479b0b5960674bf8e5a27ee98f68b3cd06ce

https://ci.arvados.org/view/Developer/job/developer-run-tests/2581/

#15 Updated by Lucas Di Pentima 4 months ago

Some comments & questions:

  • If we're starting to use the config loader on crunch-dispatch-local, do you think it would be convenient to create the arvados client (line 89) from the config data instead of an env var? That way we could avoid potentially difficult-to-debug problems where a dispatcher runs against one cluster but uses the runtime engine of another.
  • Related to the above comment: I think a migration note should be added if c-d-l depends on the config file being present.
  • I'm assuming that we don't have any integration tests for singularity + c-d-l, does it make sense to add some?

#16 Updated by Peter Amstutz 4 months ago

  • Subject changed from [a-c-r] should add /tmp to its job description, or not use it (affects singularity which has a read-only container filesystem) to singularity not setting working directory

#17 Updated by Peter Amstutz 4 months ago

As it turns out, a-c-r isn't trying to create temp directories in /tmp, they get created in the current directory.

This revealed the actual problem which was that crunch-run was not setting the working directory when using singularity, so the program was being started in /root (not writable) instead of /var/spool/cwl (writable).

#18 Updated by Peter Amstutz 3 months ago

Lucas Di Pentima wrote:

Some comments & questions:

  • If we're starting to use the config loader on crunch-dispatch-local, do you think it would be convenient to create the arvados client (line 89) from the config data instead of an env var? That way we could avoid potentially difficult-to-debug problems where a dispatcher runs against one cluster but uses the runtime engine of another.
  • Related to the above comment: I think a migration note should be added if c-d-l depends on the config file being present.

Uses the cluster config now. Added upgrade note.

  • I'm assuming that we don't have any integration tests for singularity + c-d-l, does it make sense to add some?

I don't think so, it isn't like our unit test framework could have helped discover that the feature didn't exist, it required the kind of system testing that I was already doing.

17816-crunch-dispatch-singularity @ f2ee5bac37391ce9fe084306da332becd7620ca7

In addition, I fixed the actual original bug, which was the "read only file system" error. I also discovered an apparent discrepancy between the comment on MarshalManifest (that it is supposed to flush before getting the manifest text) and the actual behavior (it doesn't) -- fixed by adding a call to Flush().

17816-singularity-cwd @ eec5086af5c2d1c1f17bbc525cc68d394c9680f4

#20 Updated by Peter Amstutz 3 months ago

Ran the entire CWL test suite. Getting a semi-random failure:

container creation failed: mount /tmp/crunch-run.x2z00-dz642-38q11t2a8r1d3tv.087008439/keep438609578/by_id/04f89c0db086d2496544715d9ddc4875+72/renamed-filelist.txt->/var/spool/cwl/renamed-filelist.txt error: while mounting /tmp/crunch-run.x2z00-dz642-38q11t2a8r1d3tv.087008439/keep438609578/by_id/04f89c0db086d2496544715d9ddc4875+72/renamed-filelist.txt: destination /var/spool/cwl/renamed-filelist.txt doesn't exist in container

/var/spool/cwl is the output directory and /var/spool/cwl/renamed-filelist.txt is a file that's supposed to be staged in the output directory.

It doesn't fail every time so I suspect that we need to sort the bind mounts to ensure that /var/spool/cwl is included on the command line before /var/spool/cwl/renamed-filelist.txt

#22 Updated by Tom Clegg 3 months ago

17816-singularity-cwd

#23 Updated by Peter Amstutz 3 months ago

17816-singularity-cwd @ 9a1c67deabd249e068284bb86f148d4aa9998711

https://ci.arvados.org/view/Developer/job/developer-run-tests/2587/

#24 Updated by Tom Clegg 3 months ago

LGTM, thanks

#25 Updated by Peter Amstutz 3 months ago

  • Status changed from New to Resolved

Also available in: Atom PDF