Bug #6094

[Crunch] Install script exited 141

Added by Peter Amstutz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
05/22/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Pipeline qr1hi-d1hrv-yer77lzhezepsui

2015-05-20_12:15:57 salloc: Granted job allocation 2617 2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20150205181653, 0.1.20150128223752, 0.1.20150121183928, 0.1.20141209151444, 0.1.20141014201516, 0.1.20140919104705, 0.1.20140905165259, 0.1.20140827170424, 0.1.20140825141611, 0.1.20140812162850, 0.1.20140708213257, 0.1.20140707162447, 0.1.20140630151639, 0.1.20140513131358, 0.1.20140513101345, 0.1.20140414145041 
2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 check slurm allocation 
2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 node compute32 - 1 slots 
2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 start 
2015-05-20_12:15:58 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Clean work dirs 
2015-05-20_12:15:58 starting: ['srun','--nodelist=compute32','-D','/tmp','bash','-ec','mount -t fuse,fuse.keep | awk \'($3 ~ /\\ykeep\\y/){print $3}\' | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Cleanup command exited 0 
2015-05-20_12:16:00 starting: ['srun','--nodelist=compute32','/bin/sh','-ec',' if ! /usr/bin/docker.io images -q --no-trunc --all | grep -qxF b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b; then arv-get 877b0bb063029e309ec3d0624e75eeda\\+503\\/b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b\\.tar | /usr/bin/docker.io load fi '] 
2015-05-20_12:16:00 starting: ['srun','--nodelist=compute32','/bin/sh','-ec','/usr/bin/docker.io run --help | grep -qe --memory-swap='] 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Packing Arvados SDK version 29009a1c1f8a9653042c5853832881aca4141cf2 for installation 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Looking for version c59093180fd92f0d7c6607a49458446212ebd058 from repository arvados 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Using local repository '/var/lib/arvados/internal.git' 
2015-05-20_12:16:00 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Version c59093180fd92f0d7c6607a49458446212ebd058 is commit c59093180fd92f0d7c6607a49458446212ebd058 
2015-05-20_12:16:01 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Run install script on all workers 
2015-05-20_12:16:01 starting: ['srun','--nodelist=compute32','-D','/tmp','--job-name=qr1hi-8i9sb-ltiwtsu42bkytce','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -'] 
2015-05-20_12:16:03 srun: error: compute32: task 0: Exited with exit code 141 
2015-05-20_12:16:03 qr1hi-8i9sb-ltiwtsu42bkytce 17131 Install script exited 141 
2015-05-20_12:16:03 salloc: Relinquishing job allocation 2617 
2015-05-20_12:16:03 Collection saved as 'qr1hi-8i9sb-ltiwtsu42bkytce.log.txt' 
2015-05-20_12:16:03 close failed in file object destructor: 
2015-05-20_12:16:03 sys.excepthook is missing 
2015-05-20_12:16:03 lost sys.stderr

Subtasks

Task #6111: Review 6094-install-script-sigpipeResolvedTom Clegg

Task #6144: Review 6094-install-script-sigpipe @ e010ab7ResolvedTom Clegg

Associated revisions

Revision 0f125dd5
Added by Tom Clegg over 4 years ago

Merge branch '6094-install-script-sigpipe' (early part) refs #6094

Revision 9b67268f
Added by Tom Clegg over 4 years ago

Merge branch '6094-install-script-sigpipe' refs #6094

Revision 9413eb73
Added by Tom Clegg over 4 years ago

Merge branch '6146-ignore-tar-sigpipe' refs #6146 refs #6094

Revision 512cc931 (diff)
Added by Tom Clegg over 4 years ago

Use --blocking-factor 1 with tar -A to reduce excess trailing padding. refs #6146 refs #6094

Revision 178d3f36 (diff)
Added by Tom Clegg over 4 years ago

Tell tar to read to EOF (even if it detects trailing NULs).

Avoids SIGPIPE when feeding a tarball made with tar -A.

refs #6146 refs #6094

History

#1 Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)

#2 Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)

#3 Updated by Bryan Cosca over 4 years ago

Another instance with this error: qr1hi-d1hrv-96bdqhdlhgopth1

#5 Updated by Brett Smith over 4 years ago

  • Category set to Crunch
  • Target version changed from Bug Triage to 2015-06-10 sprint
  • Story points set to 1.0
  • Project changed from Arvados Private to Arvados

#6 Updated by Brett Smith over 4 years ago

  • Assigned To set to Brett Smith

#7 Updated by Brett Smith over 4 years ago

I created a tar archived that combines the requested Arvados SDK version and script version. That's 15144960 bytes. I ran it through just the tar extraction code of crunch-job:

#!/usr/bin/perl

if (!open(TARX, "|-", "tar", "-x")) {
  die "Error launching 'tar -xC $destdir': $!";
}
# If we send too much data to tar in one write (> 4-5 MiB), it stops, and we
# get SIGPIPE.  We must feed it data incrementally.
my $tar_input;
while (read(DATA, $tar_input, 65536)) {
  print TARX $tar_input;
}
if(!close(TARX)) {
  die "'tar -xC $destdir' exited $?: $!";
}

__DATA__

That worked. So right now I suspect an error passing the data over the wire via SLURM to the compute nodes.

#8 Updated by Brett Smith over 4 years ago

  • Status changed from New to In Progress

#9 Updated by Tom Clegg over 4 years ago

This doesn't show the warning message "srun: error: compute32: task 0: Exited with exit code 141" but it does seem pretty close to what we're seeing:

9tee4:/tmp# n=10000000; ((set -o pipefail; echo 'exit 0;'; echo __DATA__; head -c$n /dev/urandom) | sudo -u crunch srun -n1 sh -c 'sleep 1; perl -'); echo $?
141

This command exits 0 with small n, but exits 141 when n is bigger than about 4000000.

Our install script exits without reading its __DATA__ section when this condition gets hit:

### Installation mode
open L, ">", "$destdir.lock" or die "$destdir.lock: $!";
flock L, LOCK_EX;
if (readlink ("$destdir.commit") eq $commit && -d $destdir) {
  # This version already installed -> nothing to do.
  exit(0);
}

#10 Updated by Brett Smith over 4 years ago

  • Assigned To changed from Brett Smith to Tom Clegg

I'm satisfied with that explanation, and I'm satisfied with fbcc5ab if it passes testing. Thanks.

#11 Updated by Tom Clegg over 4 years ago

6094-install-script-sigpipe @ b4c2a96 ensures the DATA section gets consumed even in that "don't need to extract" case, but I'm now less convinced this actually explains the failures.

Currently focusing on propagating the install script's stderr to the job log.

#12 Updated by Brett Smith over 4 years ago

Reviewing e010ab73 (literally, just that commit)

It looks like you intend to use the recipe in select's perldoc to select specific filehandles for set periods of time. However, this commit does not assign 1 to vec's lvalue, which seems critical to making the recipe work. Is that an oversight? Or is this code intentionally doing something else? Right now it reads to me as:

my $bits = '';
# No-op: vec() returns a value, which we do nothing with.
# It has no other side effects.
vec($bits, fileno($install_stderr_r), 1);
# select('', undef, '', 0.1) - it's not clear what that will do, but I'm
# guessing it's a lot like select(undef, undef, undef, 0.1), a.k.a. sleep(0.1).
select ($bits, undef, $bits, 0.1);

If that's not right, well, I guess I'm just saying this code is awfully subtle.

Out of curiosity, did you test this anywhere? I'd be happy to kick the tires on my Docker cluster when I get to the office if that's helpful.

Thanks.

#13 Updated by Tom Clegg over 4 years ago

Brett Smith wrote:

this commit does not assign 1 to vec's lvalue

Whoops, yes, that was an oversight. Amended (now at c96f21f) by using the handy fhbits() sub that was sitting around unused.

Out of curiosity, did you test this anywhere? I'd be happy to kick the tires on my Docker cluster when I get to the office if that's helpful.

Tested on staging (http://curover.se/4xphq-8i9sb-zon7v8b0846mcwq), just by sticking "warn 'hola';" in the install script.

Looks like this:

2015-05-22_21:38:07 4xphq-8i9sb-zon7v8b0846mcwq 17571  Run install script on all workers
2015-05-22_21:38:07 4xphq-8i9sb-zon7v8b0846mcwq 17571  stderr starting: ['srun','--nodelist=compute1','-D','/tmp','--job-name=4xphq-8i9sb-zon7v8b0846mcwq','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -']
2015-05-22_21:38:07 4xphq-8i9sb-zon7v8b0846mcwq 17571  stderr hola at - line 129.
2015-05-22_21:38:07 4xphq-8i9sb-zon7v8b0846mcwq 17571  Install script exited 0

#14 Updated by Brett Smith over 4 years ago

Tom Clegg wrote:

Brett Smith wrote:

this commit does not assign 1 to vec's lvalue

Whoops, yes, that was an oversight. Amended (now at c96f21f) by using the handy fhbits() sub that was sitting around unused.

c96f21fc is good to merge. Thanks.

#15 Updated by Brett Smith over 4 years ago

  • Status changed from In Progress to Resolved

Really fixed in 178d3f36.

Also available in: Atom PDF