Project

General

Profile

Actions

Bug #17816

closed

singularity not setting working directory

Added by Ward Vandewege almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

As part of testing in #17755, it appears that there is a problem with /tmp (?) not being writable in our singularity containers. a-c-r relies on this, and many other tools will likely do too. It seems a bit unreasonable for /tmp to be unwritable, if that is really the problem. Maybe we can change something about the way we are invoking singularity?

See e.g. ce8i5-xvhdp-1ubjhuo87i24ora:

2021-06-16T20:15:21.003947318Z INFO /usr/bin/arvados-cwl-runner 2.3.0.dev20210610215458, arvados-python-client 2.3.0.dev20210610215458, cwltool 3.0.20210319143721
2021-06-16T20:15:21.022483370Z INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
2021-06-16T20:15:23.231706231Z INFO Using cluster ce8i5 (https://workbench2.ce8i5.arvadosapi.com/)
2021-06-16T20:15:30.386974208Z INFO Using collection cache size 256 MiB
2021-06-16T20:15:30.448702283Z INFO Running inside container ce8i5-dz642-9asvb2g41z514n8
2021-06-16T20:15:30.665440997Z INFO [workflow workflow.json#main] start
2021-06-16T20:15:30.666220299Z INFO [workflow workflow.json#main] starting step substep
2021-06-16T20:15:30.666896701Z INFO [step substep] start
2021-06-16T20:15:31.473008286Z WARNING X-Keep-Storage-Classes header not supported by the cluster
2021-06-16T20:15:31.700830931Z INFO Using collection ce8i5-4zz18-df50zijeqkpbdaf
2021-06-16T20:15:36.635135414Z ERROR Unexpected exception
2021-06-16T20:15:36.635135414Z Traceback (most recent call last):
2021-06-16T20:15:36.635135414Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job
2021-06-16T20:15:36.635135414Z     runtimeContext,
2021-06-16T20:15:36.635135414Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job
2021-06-16T20:15:36.635135414Z     j.stagedir = runtimeContext.create_tmpdir()
2021-06-16T20:15:36.635135414Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir
2021-06-16T20:15:36.635135414Z     return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir)
2021-06-16T20:15:36.635135414Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp
2021-06-16T20:15:36.635135414Z     _os.mkdir(file, 0o700)
2021-06-16T20:15:36.635135414Z OSError: [Errno 30] Read-only file system: 'tmpph0scv23'
2021-06-16T20:15:36.831953372Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpph0scv23'
2021-06-16T20:15:37.120042388Z INFO [step substep] start
2021-06-16T20:15:37.240278629Z ERROR Unexpected exception
2021-06-16T20:15:37.240278629Z Traceback (most recent call last):
2021-06-16T20:15:37.240278629Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job
2021-06-16T20:15:37.240278629Z     runtimeContext,
2021-06-16T20:15:37.240278629Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job
2021-06-16T20:15:37.240278629Z     j.stagedir = runtimeContext.create_tmpdir()
2021-06-16T20:15:37.240278629Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir
2021-06-16T20:15:37.240278629Z     return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir)
2021-06-16T20:15:37.240278629Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp
2021-06-16T20:15:37.240278629Z     _os.mkdir(file, 0o700)
2021-06-16T20:15:37.240278629Z OSError: [Errno 30] Read-only file system: 'tmpi4ayzhq8'
2021-06-16T20:15:37.383785035Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpi4ayzhq8'
2021-06-16T20:15:37.546091895Z INFO [step substep] start
2021-06-16T20:15:37.660351019Z ERROR Unexpected exception
2021-06-16T20:15:37.660351019Z Traceback (most recent call last):
2021-06-16T20:15:37.660351019Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job
2021-06-16T20:15:37.660351019Z     runtimeContext,
2021-06-16T20:15:37.660351019Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job
2021-06-16T20:15:37.660351019Z     j.stagedir = runtimeContext.create_tmpdir()
2021-06-16T20:15:37.660351019Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir
2021-06-16T20:15:37.660351019Z     return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir)
2021-06-16T20:15:37.660351019Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp
2021-06-16T20:15:37.660351019Z     _os.mkdir(file, 0o700)
2021-06-16T20:15:37.660351019Z OSError: [Errno 30] Read-only file system: 'tmpybc3dv2f'
2021-06-16T20:15:37.867057605Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmpybc3dv2f'
2021-06-16T20:15:37.985364440Z INFO [step substep] start
2021-06-16T20:15:38.084313721Z ERROR Unexpected exception
2021-06-16T20:15:38.084313721Z Traceback (most recent call last):
2021-06-16T20:15:38.084313721Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/workflow.py", line 436, in job
2021-06-16T20:15:38.084313721Z     runtimeContext,
2021-06-16T20:15:38.084313721Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 964, in job
2021-06-16T20:15:38.084313721Z     j.stagedir = runtimeContext.create_tmpdir()
2021-06-16T20:15:38.084313721Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/site-packages/cwltool/context.py", line 181, in create_tmpdir
2021-06-16T20:15:38.084313721Z     return tempfile.mkdtemp(prefix=tmp_prefix, dir=tmp_dir)
2021-06-16T20:15:38.084313721Z   File "/usr/share/python3/dist/python3-arvados-cwl-runner/lib/python3.7/tempfile.py", line 505, in mkdtemp
2021-06-16T20:15:38.084313721Z     _os.mkdir(file, 0o700)
2021-06-16T20:15:38.084313721Z OSError: [Errno 30] Read-only file system: 'tmp8_atoptz'
2021-06-16T20:15:38.254878304Z ERROR Cannot make scatter job: [Errno 30] Read-only file system: 'tmp8_atoptz'
2021-06-16T20:15:38.424077683Z WARNING [step substep] completed permanentFail
2021-06-16T20:15:38.585621241Z INFO [workflow workflow.json#main] completed permanentFail
2021-06-16T20:15:38.585821742Z ERROR Overall process status is permanentFail

Subtasks 3 (0 open3 closed)

Task #17871: Review 17816-singularity-cwdResolvedTom Clegg07/15/2021Actions
Task #17907: fixResolvedPeter Amstutz07/15/2021Actions
Task #17915: Review 17816-crunch-dispatch-singularityResolvedTom Clegg07/14/2021Actions

Related issues

Related to Arvados Epics - Idea #16305: Singularity supportResolved01/01/202109/30/2021Actions
Blocks Arvados - Idea #17755: Test singularity support on a cloud cluster by running some real workflowsResolvedWard Vandewege09/03/2021Actions
Actions #1

Updated by Ward Vandewege almost 3 years ago

  • Description updated (diff)
  • Subject changed from [singularity] /tmp (?) is not writeable to [singularity] error: read-only file system
Actions #2

Updated by Ward Vandewege almost 3 years ago

  • Related to Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Actions #3

Updated by Ward Vandewege almost 3 years ago

  • Related to deleted (Idea #17755: Test singularity support on a cloud cluster by running some real workflows)
Actions #4

Updated by Ward Vandewege almost 3 years ago

  • Blocked by Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Actions #5

Updated by Ward Vandewege almost 3 years ago

  • Blocked by deleted (Idea #17755: Test singularity support on a cloud cluster by running some real workflows)
Actions #6

Updated by Ward Vandewege almost 3 years ago

  • Blocks Idea #17755: Test singularity support on a cloud cluster by running some real workflows added
Actions #7

Updated by Tom Clegg almost 3 years ago

Proposed solution: When creating a container request, api/controller has a (configurable?) list of mount points like /tmp and /var/tmp that automatically get added as mounts (as if they had been specified with {"kind":"tmp","capacity":10000000,"device_type":"disk"}) if the container request spec does not mount anything at or above that point.

The current implementation of crunch-run ignores the "capacity" argument (it's used elsewhere for choosing a node type, but crunch-run doesn't try to limit usage at runtime) so the arbitrary size 10000000 doesn't really matter.

Arguably, the requester really should be specifying these mounts explicitly instead of expecting the entire filesystem to be writable -- but configurable automatic/implicit mounts should make the migration much easier, with the option to turn it off once all clients/workflows are updated with explicit mounts.

Actions #8

Updated by Tom Clegg almost 3 years ago

Actions #9

Updated by Peter Amstutz almost 3 years ago

  • Target version deleted (To Be Groomed)
Actions #10

Updated by Ward Vandewege almost 3 years ago

  • Subject changed from [singularity] error: read-only file system to [singularity] a-c-r should add /tmp to its job description, or not use it
Actions #11

Updated by Ward Vandewege almost 3 years ago

  • Target version set to 2021-07-21 sprint
Actions #12

Updated by Ward Vandewege almost 3 years ago

  • Subject changed from [singularity] a-c-r should add /tmp to its job description, or not use it to [a-c-r] should add /tmp to its job description, or not use it (affects singularity which has a read-only container filesystem)
Actions #13

Updated by Peter Amstutz almost 3 years ago

  • Assigned To set to Peter Amstutz
Actions #14

Updated by Peter Amstutz almost 3 years ago

17816-crunch-dispatch-singularity

Adds --runtime-engine to the crunch-run invocation of crunch-dispatch-local and crunch-dispatch-slurm.

arvados|753d479b0b5960674bf8e5a27ee98f68b3cd06ce

developer-run-tests: #2581

Actions #15

Updated by Lucas Di Pentima almost 3 years ago

Some comments & questions:

  • If we're starting to use the config loader on crunch-dispatch-local, do you think it would be convenient to create the arvados client (line 89) from the config data instead of an env var? That way we could avoid potentially difficult-to-debug problems where a dispatcher runs against one cluster but uses the runtime engine of another.
  • Related to the above comment: I think a migration note should be added if c-d-l depends on the config file being present.
  • I'm assuming that we don't have any integration tests for singularity + c-d-l, does it make sense to add some?
Actions #16

Updated by Peter Amstutz almost 3 years ago

  • Subject changed from [a-c-r] should add /tmp to its job description, or not use it (affects singularity which has a read-only container filesystem) to singularity not setting working directory
Actions #17

Updated by Peter Amstutz almost 3 years ago

As it turns out, a-c-r isn't trying to create temp directories in /tmp, they get created in the current directory.

This revealed the actual problem which was that crunch-run was not setting the working directory when using singularity, so the program was being started in /root (not writable) instead of /var/spool/cwl (writable).

Actions #18

Updated by Peter Amstutz almost 3 years ago

Lucas Di Pentima wrote:

Some comments & questions:

  • If we're starting to use the config loader on crunch-dispatch-local, do you think it would be convenient to create the arvados client (line 89) from the config data instead of an env var? That way we could avoid potentially difficult-to-debug problems where a dispatcher runs against one cluster but uses the runtime engine of another.
  • Related to the above comment: I think a migration note should be added if c-d-l depends on the config file being present.

Uses the cluster config now. Added upgrade note.

  • I'm assuming that we don't have any integration tests for singularity + c-d-l, does it make sense to add some?

I don't think so, it isn't like our unit test framework could have helped discover that the feature didn't exist, it required the kind of system testing that I was already doing.

17816-crunch-dispatch-singularity @ f2ee5bac37391ce9fe084306da332becd7620ca7

In addition, I fixed the actual original bug, which was the "read only file system" error. I also discovered an apparent discrepancy between the comment on MarshalManifest (that it is supposed to flush before getting the manifest text) and the actual behavior (it doesn't) -- fixed by adding a call to Flush().

17816-singularity-cwd @ eec5086af5c2d1c1f17bbc525cc68d394c9680f4

Actions #20

Updated by Peter Amstutz almost 3 years ago

Ran the entire CWL test suite. Getting a semi-random failure:

container creation failed: mount /tmp/crunch-run.x2z00-dz642-38q11t2a8r1d3tv.087008439/keep438609578/by_id/04f89c0db086d2496544715d9ddc4875+72/renamed-filelist.txt->/var/spool/cwl/renamed-filelist.txt error: while mounting /tmp/crunch-run.x2z00-dz642-38q11t2a8r1d3tv.087008439/keep438609578/by_id/04f89c0db086d2496544715d9ddc4875+72/renamed-filelist.txt: destination /var/spool/cwl/renamed-filelist.txt doesn't exist in container

/var/spool/cwl is the output directory and /var/spool/cwl/renamed-filelist.txt is a file that's supposed to be staged in the output directory.

It doesn't fail every time so I suspect that we need to sort the bind mounts to ensure that /var/spool/cwl is included on the command line before /var/spool/cwl/renamed-filelist.txt

Actions #22

Updated by Tom Clegg almost 3 years ago

17816-singularity-cwd
Actions #23

Updated by Peter Amstutz almost 3 years ago

17816-singularity-cwd @ 9a1c67deabd249e068284bb86f148d4aa9998711

developer-run-tests: #2587

Actions #24

Updated by Tom Clegg almost 3 years ago

LGTM, thanks

Actions #25

Updated by Peter Amstutz almost 3 years ago

  • Status changed from New to Resolved
Actions #26

Updated by Peter Amstutz over 2 years ago

  • Release set to 42
Actions

Also available in: Atom PDF