Bug #3113

Some crunch tasks miss some input files and have others duplicated

Added by Abram Connelly about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Crunch
Target version:
Start date:
07/09/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

When running pipeline instance qr1hi-d1hrv-qh1k1mg3qqokv1s on collection 0938aa40406992ee7c02666b2708fbce+73567 , some input files do not get processed and others are duplicated.

The program from the above pipeline consists of a single job, calling 'one_task_per_input_file' on the collection:

#!/usr/bin/env python
#
# Simple program to expose dropped input files and duplicated input files.
#

import arvados
import os
import sys
import subprocess as sp

arvados.job_setup.one_task_per_input_file( if_sequence=0, and_end_task=True, input_as_path=True )

this_job = arvados.current_job()
this_task = arvados.current_task()
this_task_input = this_task['parameters']['input']

work_dir  = os.environ['CRUNCH_SRC']
mount_dir = os.environ['TASK_KEEPMOUNT']

input_filename = arvados.get_task_param_mount('input')

out_dir = os.path.join( arvados.current_task().tmpdir, "output" )
os.mkdir( out_dir )

out_fn = os.path.join( out_dir, "dummyfile" )
print "INPUTFILE:", input_filename

dummyExec = os.path.join( work_dir, "crunch_scripts/multipleFileBug/writedummy.sh" )
pOut = sp.check_output( [ dummyExec, "dummytext", out_fn ] )

out = arvados.CollectionWriter()
out.write_directory_tree( out_dir, max_manifest_depth=0 )
this_task.set_output( out.finish() )
sys.exit(0)

In the output log 'd3bee5816a311e5cee846e06ef2b97ec+89' should produce 863 unique file paths on the 'INPUTFILE:' lines, instead it porduces 860. The input file 'chr8_band2_s6200000_e12700000.bedGraph' appears four times when it should only appear once. The files 'chr11_band4_s16200000_e21700000.bedGraph', 'chr13_band6_s23300000_e25500000.bedGraph' and 'chr18_band0_s0_e2900000.bedGraph' should appear but are missed.


Subtasks

Task #3218: Review 3113-qsequence-serialResolvedBrett Smith

Associated revisions

Revision 1b4adbdc
Added by Brett Smith about 5 years ago

Merge branch '3113-qsequence-serial'

Refs #3113. This is a useful bugfix, but we need to test whether
this fully resolves the issue.

Revision 56303d3c (diff)
Added by Brett Smith about 5 years ago

3113: Make job task test more robust.

Per Tom's feedback in refs #3113.
I should've made this change before I merged; apologies.

History

#1 Updated by Tom Clegg about 5 years ago

  • Story points set to 1.0

#2 Updated by Tom Clegg about 5 years ago

  • Target version set to 2014-07-16 Sprint

#3 Updated by Tim Pierce about 5 years ago

  • Assigned To set to Tim Pierce

#4 Updated by Tim Pierce about 5 years ago

I haven't been able to reproduce this issue. I cloned git@git.qr1hi.arvadosapi.com:abram.git and copied the multipleFileBug directory into my 4xphq repository, and ran pipeline https://workbench.4xphq.arvadosapi.com/pipeline_instances/4xphq-d1hrv-022dunoh2pt5sdd. The pipeline completed successfully, but generated the empty collection and did not produce a diagnostic log. According to the job API response, the expected 864 tasks were run successfully.

#5 Updated by Tim Pierce about 5 years ago

  • Category set to Crunch
  • Status changed from New to Rejected
  • Target version changed from 2014-07-16 Sprint to Deferred

Rejecting this bug -- let's revisit it if we can reproduce the problem reliably.

#6 Updated by Tim Pierce about 5 years ago

  • Status changed from Rejected to New
  • Target version changed from Deferred to 2014-08-06 Sprint

Returning this to "new" -- there are other bugs (#3212 at least) preventing me from adequately determining whether this can be reproduced. Will attempt to reproduce again after that has been resolved.

#7 Updated by Brett Smith about 5 years ago

It looks like the SDK set up the tasks correctly:

>>> tasks = api.job_tasks().list(filters=[['job_uuid', '=', 'qr1hi-8i9sb-j0gyutpct11dcnb'], ['sequence', '=', '1']], limit=1000).execute()
>>> len(tasks['items'])
863
>>> inputs = set(t['parameters']['input'] for t in tasks['items'])
>>> len(inputs)
863
>>> any(inp for inp in inputs if inp.endswith('chr13_band6_s23300000_e25500000.bedGraph'))
True
>>> any(inp for inp in inputs if inp.endswith('chr11_band4_s16200000_e21700000.bedGraph'))
True

#8 Updated by Brett Smith about 5 years ago

Here's the problem:

>>> set(t['qsequence'] for t in tasks['items'])
set([None])

crunch-job fetches tasks ordered by qsequence. If that's not set, then it could happen that when crunch-job fetches tasks from the API server, it will get some twice, and others not at all.

JobTask on the API server has a callback after_update :assign_created_job_tasks_qsequence_if_succeeded. It seems likely this is buggy and needs fixing.

#9 Updated by Brett Smith about 5 years ago

  • Assigned To changed from Tim Pierce to Brett Smith

#10 Updated by Tom Clegg about 5 years ago

  • Status changed from New to In Progress

#11 Updated by Tom Clegg about 5 years ago

  • Target version changed from 2014-08-06 Sprint to 2014-07-16 Sprint

#12 Updated by Tom Clegg about 5 years ago

In services/api/test/unit/job_task_test.rb I think it would be slightly safer to use create! instead of create here. Otherwise (I think) it's possible for object creation to be abandoned due to some filter/hook that runs earlier than set_default_qsequence, which would let the test pass silently without really checking anything:

task = JobTask.create(qsequence: 99)

That's the only complaint I can come up with.

(As I got to the end of the diff I was thinking, "all I can think to add is a test to establish that a client-supplied qsequence doesn't get overwritten with the default" ...and there it was right at the end of the diff. Thanks!)

#13 Updated by Tom Clegg about 5 years ago

  • Status changed from In Progress to Resolved

#14 Updated by Brett Smith about 5 years ago

Now that my bugfix has deployed to staging, I tested it out. I used the following crunch script, which just spawns a bunch of tasks (how many is specified as a job parameter) and has each one write out a line. With enough tasks spawned, duplicate/missing tasks should be noticeable.

#!/usr/bin/env python

import arvados

output = arvados.CollectionWriter()
output.set_current_file_name("log.txt")

task_count = int(arvados.getjobparam('task_count'))
task_num_fmt = '{{:0{}d}}'.format(len(str(task_count)))

def seq0(cur_task):
    task_body = {'job_uuid': arvados.current_job()['uuid'],
                 'created_by_job_task_uuid': cur_task['uuid'],
                 'sequence': 1,
                 'parameters': {'task_num': None}}
    make_task = arvados.api('v1').job_tasks().create
    for n in xrange(1, task_count + 1):
        task_body['parameters']['task_num'] = n
        make_task(body=task_body).execute()
    output.write("Origin task created {} tasks\n".format(task_count))

def seq1(cur_task):
    task_num_s = task_num_fmt.format(cur_task['parameters']['task_num'])
    output.write("Task {} has qsequence {:09d}\n".format(
            task_num_s, int(cur_task['qsequence'])))

_current_task = arvados.current_task()
_seq_func = locals()['seq{:d}'.format(int(_current_task['sequence']))]
_seq_func(_current_task)
_current_task.set_output(output.finish())

I used the following script to check the output and make sure that every task had one output line (and that their qsequence numbers increased monotonically):

#!/usr/bin/env python3

import fileinput
import sys

def numbers():
    global expected
    lines = fileinput.input()
    line1 = next(lines)
    expected = int(line1.split()[3])
    for line in lines:
        words = line.split()
        yield int(words[1]), int(words[4])

task_nums = set()
last_qseq = -1

for tnum, qseq in numbers():
    if tnum in task_nums:
        print("WARNING:", tnum, "duplicated")
    else:
        task_nums.add(tnum)
    if qseq <= last_qseq:
        print("WARNING: bad qseq step", last_qseq, qseq, "from", tnum)
    else:
        last_qseq = qseq

diff = set(range(1, expected + 1)).difference(task_nums)
if diff:
    print(diff)
    sys.exit(1)

Invoked to run ~1000 tasks, this job tickled the bug immediately on qr1hi (one task was run five times!), but three runs on staging passed with flying colors. With these results, I believe that the previous bugfix actually fixed the bug, so I'm closing this ticket.

Also available in: Atom PDF