Bug #12246

[Crunch] Better crunch-run error when command not found

Added by Peter Amstutz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
09/27/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

If a container specifies a command not found, or specifies a script with a #! line that isn't found, the error is very cryptic. It should provide a better error message.


Subtasks

Task #12265: Review 12246-command-not-foundResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #12298: [Crunch2] Invalid container output_path causes infinite loop of futile dispatch attemptsResolved09/20/2017

Associated revisions

Revision c2a861d0 (diff)
Added by Tom Clegg over 3 years ago

Fix dashboard crash on uncommitted container request.

refs #12246

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 91143ef5
Added by Tom Clegg over 3 years ago

Merge branch '12246-command-not-found'

closes #12246

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 1ebfa03c (diff)
Added by Peter Amstutz over 3 years ago

12246: Make "possible causes" message easier to find and read. refs #12246

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz over 3 years ago

  • Description updated (diff)

#2 Updated by Tom Morris over 3 years ago

  • Tracker changed from Story to Bug
  • Target version set to 2017-09-27 Sprint

At a minimum, the error message should include a quoted & escape version of the program that it is attempting to run and didn't find.

#3 Updated by Peter Amstutz over 3 years ago

  • Assigned To set to Peter Amstutz

#4 Updated by Tom Clegg over 3 years ago

  • Category set to Crunch
  • Status changed from New to In Progress
  • Assigned To changed from Peter Amstutz to Tom Clegg

#5 Updated by Tom Clegg over 3 years ago

When the command doesn't exist, the error message isn't bad:

$ arv container_request create --container-request '{"command":["foobar"],"container_image":"arvados/jobs","output_path":"/out","state":"Committed","runtime_constraints":{"vcpus":1,"ram":1000000},"priority":1,"mounts":{"/out":{"kind":"tmp","capacity":1000000}}}'

2017-09-20T20:39:14.037657193Z exec: "foobar": executable file not found in $PATH
2017-09-20T20:39:15.067555258Z could not start container: Error response from daemon: Cannot start container 58099cd76c834f3dc2a4fb76c8028f049ae6d4fdf0ec373e1f2cfea030670c2d: [8] System error: exec: "foobar": executable file not found in $PATH
2017-09-20T20:39:15.067632751Z Cancelled
2017-09-20T20:39:15.581835Z Container 9tee4-dz642-oukm4tdpxpl67dx was cancelled

#6 Updated by Tom Clegg over 3 years ago

Using a docker image with "#!/bin/durrgh" in /bin/fail, the message is more obscure:

$ arv container_request create --container-request '{"command":["/bin/fail"],"container_image":"fail","output_path":"/out","state":"Committed","runtime_constraints":{"vcpus":1,"ram":1000000},"priority":1,"mounts":{"/out":{"kind":"tmp","capacity":1000000}}}'
2017-09-20T20:50:25.549086754Z Starting Docker container id '41f26cbc43bcc1280f4323efb1830a394ba8660c9d1c2b564ba42bf7f7694845'
2017-09-20T20:50:29.091406142Z could not start container: Error response from daemon: Cannot start container 41f26cbc43bcc1280f4323efb1830a394ba8660c9d1c2b564ba42bf7f7694845: [8] System error: no such file or directory
2017-09-20T20:50:29.091468975Z Cancelled
2017-09-20T20:50:29.636949Z Container 9tee4-dz642-bi9yzjmqqlqrnj9 was cancelled

Also tried "#!/bin/durrgh\r\ndurrgh\r\n", with the same result.

#10 Updated by Peter Amstutz over 3 years ago

Tom Clegg wrote:

So if container startup fails, we should make sure to report the command being invoked...

Maybe the most helpful thing to add to the "could not start container" error is a hint about a known cause of that error:

[...]

..., and if it is located on a keep mount, see about reading the first line of the file to report that path and check for the Windows newline issue.

I feel like predicting which file(s) will be executed by a given command array will be hard to get right (e.g., what will $PATH be inside the container?), and even getting it right sometimes might not be worth the trouble...

Here's my proposed behavior if startup fails (for any reason, since trying to sniff out exact error seems like an exercise in frustration):

  1. Report the first item in the command array
  2. If the first item in the command array is located on a keep mount, check if the first two bytes are #!, if so read the first line and report it, and check for a Windows newline.

#11 Updated by Tom Clegg over 3 years ago

The panic was a runc bug, fixed here: https://github.com/opencontainers/runc/pull/1117

Inside runc the panic was then converted to an error that includes a stack trace. So the effect of the runc fix is just to reduce the message from "error + stack trace" to just "error".

#12 Updated by Tom Clegg over 3 years ago

I'm still not going to take apart $PATH and transform paths and symlinks to figure out what exec() would do in the container. That kind of fix will just come with its own bugs, etc.

Agree with note-2 and note-10 that reporting the first item in the command array seems helpful. When the command itself doesn't exist, the missing command is already mentioned (twice!) in the error message so adding it a third time doesn't seem compelling. But where bash gives a "bad interpreter" error mentioning the bad interpreter, docker is somewhat coy.

$ /tmp/bogus 
-bash: /tmp/bogus: /bin/nooooo: bad interpreter: No such file or directory
$ docker run -it --rm 1b044b40475d /bin/bogus
standard_init_linux.go:178: exec user process caused "no such file or directory" 

So in this case we add a hint.

fmt.Sprintf(" (perhaps command %q is missing, or has a missing #! interpreter, or was saved in DOS mode with cr-lf chars?)", runner.Container.Command[0])

12246-command-not-found @ deb14a7264ed4a07d154504991447c3be8413db7

#13 Updated by Tom Clegg over 3 years ago

Just to clarify about the runc panic stack trace: it seems the stack trace is not a crash, it's just a verbose error message from docker. It does include the "no such file or directory" string, so if you use a pre-bugfixed docker, you'll benefit from this new "suggest checking #!" feature, although the suggestion will be a bit harder to see above the giant wall of stack trace.

#14 Updated by Ward Vandewege over 3 years ago

  • Subject changed from Better crunch-run error when command not found to [Crunch] Better crunch-run error when command not found

#15 Updated by Peter Amstutz over 3 years ago

It runs together on a very long line, which makes it hard to read. Could the "advice" come after the error message on a separate line?

  2017-09-27T16:45:00.811779771Z crunch-run Starting Docker container id '316d454ad4bf0864a2daaa0357201a2e27382158469c33010fb5c2900708500c'
  2017-09-27T16:45:00.990318516Z stderr container_linux.go:247: starting container process caused "exec: \"/does/not/exists\": stat /does/not/exists: no such file or directory" 
  2017-09-27T16:45:01.136591766Z crunch-run could not start container (perhaps command "/does/not/exists" is missing, or has a missing #! interpreter, or was saved in DOS mode with cr-lf chars?): Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/does/not/exists\": stat /does/not/exists: no such file or directory" 
  2017-09-27T16:45:01.136607933Z crunch-run Cancelled

#16 Updated by Anonymous over 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:91143ef549e065ebdfb0138a031fc1fbd65cb527.

#17 Updated by Peter Amstutz over 3 years ago

12246-better-advice:

  2017-09-27T18:17:07.782746483Z stderr container_linux.go:247: starting container process caused "exec: \"/does/not/exists\": stat /does/not/exists: no such file or directory" 
  2017-09-27T18:17:07.930777304Z crunch-run could not start container: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/does/not/exists\": stat /does/not/exists: no such file or directory" 
  2017-09-27T18:17:07.930777304Z crunch-run Possible causes: command "/does/not/exists" is missing, the interpreter given in #! is missing, or script has Windows line endings.
  2017-09-27T18:17:07.930798102Z crunch-run Cancelled

Also available in: Atom PDF