Project

General

Profile

Actions

Bug #6094

closed

[Crunch] Install script exited 141

Added by Peter Amstutz almost 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0

Description

Pipeline qr1hi-d1hrv-yer77lzhezepsui

2015-05-20_12:15:57 salloc: Granted job allocation 2617 2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20150205181653, 0.1.20150128223752, 0.1.20150121183928, 0.1.20141209151444, 0.1.20141014201516, 0.1.20140919104705, 0.1.20140905165259, 0.1.20140827170424, 0.1.20140825141611, 0.1.20140812162850, 0.1.20140708213257, 0.1.20140707162447, 0.1.20140630151639, 0.1.20140513131358, 0.1.20140513101345, 0.1.20140414145041 
2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 check slurm allocation 
2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 node compute32 - 1 slots 
2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 start 
2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Clean work dirs 
2015-05-20_12:15:58 starting: ['srun','--nodelist=compute32','-D','/tmp','bash','-ec','mount -t fuse,fuse.keep | awk \'($3 ~ /\\ykeep\\y/){print $3}\' | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Cleanup command exited 0 
2015-05-20_12:16:00 starting: ['srun','--nodelist=compute32','/bin/sh','-ec',' if ! /usr/bin/docker.io images -q --no-trunc --all | grep -qxF b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b; then arv-get 877b0bb063029e309ec3d0624e75eeda\\+503\\/b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b\\.tar | /usr/bin/docker.io load fi '] 
2015-05-20_12:16:00 starting: ['srun','--nodelist=compute32','/bin/sh','-ec','/usr/bin/docker.io run --help | grep -qe --memory-swap='] 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Packing Arvados SDK version 29009a1c1f8a9653042c5853832881aca4141cf2 for installation 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Looking for version c59093180fd92f0d7c6607a49458446212ebd058 from repository arvados 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Using local repository '/var/lib/arvados/internal.git' 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Version c59093180fd92f0d7c6607a49458446212ebd058 is commit c59093180fd92f0d7c6607a49458446212ebd058 
2015-05-20_12:16:01 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Run install script on all workers 
2015-05-20_12:16:01 starting: ['srun','--nodelist=compute32','-D','/tmp','--job-name=qr1hi-8i9sb-ltiwtsu42bkytce','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -'] 
2015-05-20_12:16:03 srun: error: compute32: task 0: Exited with exit code 141 
2015-05-20_12:16:03 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Install script exited 141 
2015-05-20_12:16:03 salloc: Relinquishing job allocation 2617 
2015-05-20_12:16:03 Collection saved as 'qr1hi-8i9sb-ltiwtsu42bkytce.log.txt' 
2015-05-20_12:16:03 close failed in file object destructor: 
2015-05-20_12:16:03 sys.excepthook is missing 
2015-05-20_12:16:03 lost sys.stderr

Subtasks 2 (0 open2 closed)

Task #6111: Review 6094-install-script-sigpipeResolvedTom Clegg05/22/2015Actions
Task #6144: Review 6094-install-script-sigpipe @ e010ab7ResolvedTom Clegg05/22/2015Actions
Actions #1

Updated by Peter Amstutz almost 9 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 9 years ago

  • Description updated (diff)
Actions #3

Updated by Bryan Cosca almost 9 years ago

Another instance with this error: qr1hi-d1hrv-96bdqhdlhgopth1

Actions #4

Updated by Bryan Cosca almost 9 years ago

Actions #5

Updated by Brett Smith almost 9 years ago

  • Category set to Crunch
  • Target version changed from Bug Triage to 2015-06-10 sprint
  • Story points set to 1.0
  • Project changed from 35 to Arvados
Actions #6

Updated by Brett Smith almost 9 years ago

  • Assigned To set to Brett Smith
Actions #7

Updated by Brett Smith almost 9 years ago

I created a tar archived that combines the requested Arvados SDK version and script version. That's 15144960 bytes. I ran it through just the tar extraction code of crunch-job:

#!/usr/bin/perl

if (!open(TARX, "|-", "tar", "-x")) {
  die "Error launching 'tar -xC $destdir': $!";
}
# If we send too much data to tar in one write (> 4-5 MiB), it stops, and we
# get SIGPIPE.  We must feed it data incrementally.
my $tar_input;
while (read(DATA, $tar_input, 65536)) {
  print TARX $tar_input;
}
if(!close(TARX)) {
  die "'tar -xC $destdir' exited $?: $!";
}

__DATA__

That worked. So right now I suspect an error passing the data over the wire via SLURM to the compute nodes.

Actions #8

Updated by Brett Smith almost 9 years ago

  • Status changed from New to In Progress
Actions #9

Updated by Tom Clegg almost 9 years ago

This doesn't show the warning message "srun: error: compute32: task 0: Exited with exit code 141" but it does seem pretty close to what we're seeing:

9tee4:/tmp# n=10000000; ((set -o pipefail; echo 'exit 0;'; echo __DATA__; head -c$n /dev/urandom) | sudo -u crunch srun -n1 sh -c 'sleep 1; perl -'); echo $?
141

This command exits 0 with small n, but exits 141 when n is bigger than about 4000000.

Our install script exits without reading its __DATA__ section when this condition gets hit:

### Installation mode
open L, ">", "$destdir.lock" or die "$destdir.lock: $!";
flock L, LOCK_EX;
if (readlink ("$destdir.commit") eq $commit && -d $destdir) {
  # This version already installed -> nothing to do.
  exit(0);
}
Actions #10

Updated by Brett Smith almost 9 years ago

  • Assigned To changed from Brett Smith to Tom Clegg

I'm satisfied with that explanation, and I'm satisfied with fbcc5ab if it passes testing. Thanks.

Actions #11

Updated by Tom Clegg almost 9 years ago

6094-install-script-sigpipe @ b4c2a96 ensures the DATA section gets consumed even in that "don't need to extract" case, but I'm now less convinced this actually explains the failures.

Currently focusing on propagating the install script's stderr to the job log.

Actions #12

Updated by Brett Smith almost 9 years ago

Reviewing e010ab73 (literally, just that commit)

It looks like you intend to use the recipe in select's perldoc to select specific filehandles for set periods of time. However, this commit does not assign 1 to vec's lvalue, which seems critical to making the recipe work. Is that an oversight? Or is this code intentionally doing something else? Right now it reads to me as:

my $bits = '';
# No-op: vec() returns a value, which we do nothing with.
# It has no other side effects.
vec($bits, fileno($install_stderr_r), 1);
# select('', undef, '', 0.1) - it's not clear what that will do, but I'm
# guessing it's a lot like select(undef, undef, undef, 0.1), a.k.a. sleep(0.1).
select ($bits, undef, $bits, 0.1);

If that's not right, well, I guess I'm just saying this code is awfully subtle.

Out of curiosity, did you test this anywhere? I'd be happy to kick the tires on my Docker cluster when I get to the office if that's helpful.

Thanks.

Actions #13

Updated by Tom Clegg almost 9 years ago

Brett Smith wrote:

this commit does not assign 1 to vec's lvalue

Whoops, yes, that was an oversight. Amended (now at c96f21f) by using the handy fhbits() sub that was sitting around unused.

Out of curiosity, did you test this anywhere? I'd be happy to kick the tires on my Docker cluster when I get to the office if that's helpful.

Tested on staging (http://curover.se/4xphq-8i9sb-zon7v8b0846mcwq), just by sticking "warn 'hola';" in the install script.

Looks like this:

2015-05-22_21:38:07 4xphq-8i9sb-zon7v8b0846mcwq 17571  Run install script on all workers
2015-05-22_21:38:07 4xphq-8i9sb-zon7v8b0846mcwq 17571  stderr starting: ['srun','--nodelist=compute1','-D','/tmp','--job-name=4xphq-8i9sb-zon7v8b0846mcwq','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -']
2015-05-22_21:38:07 4xphq-8i9sb-zon7v8b0846mcwq 17571  stderr hola at - line 129.
2015-05-22_21:38:07 4xphq-8i9sb-zon7v8b0846mcwq 17571  Install script exited 0
Actions #14

Updated by Brett Smith almost 9 years ago

Tom Clegg wrote:

Brett Smith wrote:

this commit does not assign 1 to vec's lvalue

Whoops, yes, that was an oversight. Amended (now at c96f21f) by using the handy fhbits() sub that was sitting around unused.

c96f21fc is good to merge. Thanks.

Actions #15

Updated by Brett Smith almost 9 years ago

  • Status changed from In Progress to Resolved

Really fixed in 178d3f36.

Actions

Also available in: Atom PDF