Bug #5283

[Crunch] crunch-job apparently died while collating job's output

Added by Brett Smith almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Crunch
Target version:
Start date:
02/19/2015
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
1.0

Description

See qr1hi-8i9sb-ds95g67jm2jpmnd. The last line in the log says "collate", which means that we've entered create_output_collection. Normally, after that function is done, the main code body logs information about the result, including an error message if there is none, and then "finish" either way. None of that happened. Apparently crunch-job died during the collation step.

I checked the crunch-dispatch logs for anything interesting. Unfortunately, there's nothing you don't see in the log.


Subtasks

Task #5306: Review 5283-crunch-collation-safety-wipResolvedBrett Smith


Related issues

Related to Arvados - Support #5251: [Support] Fix bugs and write tests (first half)Resolved

Associated revisions

Revision 62fea2a4
Added by Brett Smith almost 6 years ago

Merge branch '5283-crunch-collation-safety-wip'

Closes #5283, #5306.

History

#1 Updated by Brett Smith almost 6 years ago

qr1hi-8i9sb-gnwpoyq1fgc2vky is a job that's almost identical, except the Crunch script explicitly sets its task's success flag to true before exiting. The earlier version left the flag undefined. crunch-job saved the new job's output successfully.

This might just be a coincidence, but this potentially suggests that there might be a bug from #4472.

#2 Updated by Brett Smith almost 6 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith

#3 Updated by Brett Smith almost 6 years ago

  • Target version changed from Bug Triage to 2015-03-11 sprint

#4 Updated by Brett Smith almost 6 years ago

I created a small Perl script that fetched the failing task, and fed its output to a flattened version of create_output_collection (flattened to accept a single task as an argument, rather than iterating over @jobstep). This succeeded fine. Unless I messed something up in that test, this looks like some kind of transient issue rather than a code bug.

There are several TODOs in create_output_collection about handling errors from the collection creation process (a Python pipe). Right now I think I'd point to some issue there as the likeliest culprit.

#5 Updated by Brett Smith almost 6 years ago

qr1hi-8i9sb-mcla1dzm2zrpl0t is a rerun of the exact same original job, and it died the exact same way. My test must be buggy; there definitely seems to be a real bug here.

#6 Updated by Brett Smith almost 6 years ago

5283-crunch-collation-safety-wip is up for review. It probably won't fix the problem, but it makes a number of reliability improvements in this part of the code. Once it's staged on 4xphq, we could re-test this script and see if the additional logging can point us at the real problem.

One thing I noticed during testing is that the whole script dies very suddenly with SIGPIPE if the collection creation pipe (the one in create_output_collection) dies. It seems very possible there's some issue there.

#7 Updated by Brett Smith almost 6 years ago

This issue seems to be reproducible with any Crunch job that sets task output without setting success. I ran a very small test script on qr1hi that does just this, and Crunch also died during the collate step while running it. That will help test the problem more rapidly, at least.

#8 Updated by Peter Amstutz almost 6 years ago

  • It's supposed to skip the task if the output is empty (next unless exists $_->{'arvados_task'}->{'output'};) but tests to see if the output is valid using if (defined($next_write)). I just tried running crunch-job locally and had a task which failed with no output. It passed the first test (exists) but failed the second one (defined).
  • Since you are in this code, could you add a) log the size of each $next_write block and b) sum up and log the total size of the manifest that is sent to the API server? I suspect we would have spent much less time on #4956 (and its predecessor story) had we realized up front what we were actually dealing with.
  • Specifying a 64 MiB buffer to read a 40 byte portable data hash seems... excessive? (probably a holdover from something else)
    sysread($child_out, $joboutput, 64 * 1024 * 1024);
    

#9 Updated by Brett Smith almost 6 years ago

Peter Amstutz wrote:

  • It's supposed to skip the task if the output is empty (next unless exists $_->{'arvados_task'}->{'output'};) but tests to see if the output is valid using if (defined($next_write)).

Sorry, that's not correct. exists is to Ruby's has_key? as defined is to Ruby's nil?. The logic here matches what's in current master, and I believe it's intentional to enforce the rule that tasks must set output to be considered successful.

  • Since you are in this code, could you add a) log the size of each $next_write block and b) sum up and log the total size of the manifest that is sent to the API server?

We already log the output of each task. I added size information to that message, plus a log about the size of the final collated manifest.

  • Specifying a 64 MiB buffer to read a 40 byte portable data hash seems... excessive? (probably a holdover from something else)

Yeah, that was already here and I was leaving it alone, but I've adjusted it since you asked. I left it at 1MiB to leave plenty of room for future locator hints.

I also pushed another commit that stops setting please_freeze once the job fails permanently. Based on testing in my Docker environment, this fixes the actual bug: what was happening for the failure case is that after we exited the main loop, we stepped into freeze_if_want_freeze, which collates the output but doesn't actually save it to the job record. I don't believe we want this behavior anymore, so I stopped requesting the freeze.

Now at 538f4a8. Thanks.

#10 Updated by Peter Amstutz almost 6 years ago

Running a test pipeline:

4n8aq-8i9sb-v0obsiyl97duxlo 19167  collate
4n8aq-8i9sb-v0obsiyl97duxlo 19167  Error retrieving '' output by task 0 (4n8aq-ot0gb-wkfo7tizq8cbkvz)
4n8aq-8i9sb-v0obsiyl97duxlo 19167  collated output manifest is 105 bytes
4n8aq-8i9sb-v0obsiyl97duxlo 19167  output hash 9c4ad41a0f62aeafdf95f4df222a251b+54
4n8aq-8i9sb-v0obsiyl97duxlo 19167  finish
  1. Why is it saying "Error retrieving '' output by task 0 (4n8aq-ot0gb-wkfo7tizq8cbkvz)"?
  2. The collated output manifest is 105 bytes, but the output hash is "9c4ad41a0f62aeafdf95f4df222a251b+54" which is 54 bytes. Presumably this is because access tokens are stripped. Considering tweaking the message to say something to that effect.
  3. Could you change "output (%d bytes): %s" to say "task output" and the final "output hash" to say "job output" to disambiguate.

#11 Updated by Brett Smith almost 6 years ago

Peter Amstutz wrote:

Running a test pipeline:

[...]

  1. Why is it saying "Error retrieving '' output by task 0 (4n8aq-ot0gb-wkfo7tizq8cbkvz)"?

Well, what does your pipeline do? This is expected behavior if a task fails to set an output attribute. I have tried to clarify the log message here.

  1. The collated output manifest is 105 bytes, but the output hash is "9c4ad41a0f62aeafdf95f4df222a251b+54" which is 54 bytes. Presumably this is because access tokens are stripped. Considering tweaking the message to say something to that effect.
  2. Could you change "output (%d bytes): %s" to say "task output" and the final "output hash" to say "job output" to disambiguate.

Both done at a12247f.

#12 Updated by Peter Amstutz almost 6 years ago

Brett Smith wrote:

Well, what does your pipeline do? This is expected behavior if a task fails to set an output attribute. I have tried to clarify the log message here.

The exact same pipeline succeeds on master, so this is a regression. It uses one_task_per_input_file() from the SDK, which does this:

        if and_end_task:
            api('v1').job_tasks().update(uuid=current_task()['uuid'],
                                       body={'success':True}
                                       ).execute()
            exit(0)

I'm pretty sure a task with success == true has always been considered successful even if there is no output..

  1. The collated output manifest is 105 bytes, but the output hash is "9c4ad41a0f62aeafdf95f4df222a251b+54" which is 54 bytes. Presumably this is because access tokens are stripped. Considering tweaking the message to say something to that effect.
  2. Could you change "output (%d bytes): %s" to say "task output" and the final "output hash" to say "job output" to disambiguate.

Both done at a12247f.

"collated output manifest is $manifest_size bytes, including any access tokens" reads a little bit awkwardly. The other log messages look good.

#13 Updated by Brett Smith almost 6 years ago

I'll work on fixing the bug.

Peter Amstutz wrote:

"collated output manifest is $manifest_size bytes, including any access tokens" reads a little bit awkwardly. The other log messages look good.

Tell me exactly what wording you want and I'll implement that.

#14 Updated by Peter Amstutz almost 6 years ago

"collated output manifest text to send to API server is $manifest_size bytes with access tokens"

#15 Updated by Brett Smith almost 6 years ago

Peter Amstutz wrote:

Brett Smith wrote:

Well, what does your pipeline do? This is expected behavior if a task fails to set an output attribute. I have tried to clarify the log message here.

The exact same pipeline succeeds on master, so this is a regression.

Hahaha OK so what current master does is it gets $output = undef and then it just goes ahead and prints that to the $child_in pipe because, you know, what could possibly go wrong passing around undef to outside processes? use warnings is for the weak.

Fixed and your log change implemented at 253e285. Thanks.

#16 Updated by Peter Amstutz almost 6 years ago

a61b22a LGTM. Please merge.

#17 Updated by Brett Smith almost 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:62fea2a463eb0d8c3aef84a56ba84561a196666b.

#18 Updated by Brett Smith over 5 years ago

  • Story points set to 1.0

Also available in: Atom PDF