Bug #5026

[API] Prohibit leaving jobs in an inconsistent state (e.g., state=queued with non-null log or output)

Added by Ward Vandewege over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
1.0

Description

qr1hi-8i9sb-lxlj7ut2ynfk4nj is a run-command job. Workbench has some very weird start/stop information about this job:

created_at: 12:08 PM 1/19/2015
started_at: 12:08 PM 1/19/2015
finished_at: 11:37 AM 12/30/2014

The logs that show for this job are from the run that finished on 12/30/2014.

The crunch-dispatch --jobs logs show that it keeps trying to start the job, but finds it locked and then keeps retrying:

2015-01-20_14:34:52.79332 11901 API method jobs/lock failed: API call /jobs/qr1hi-8i9sb-lxlj7ut2ynfk4nj/lock failed: 403 Forbidden {"errors":["#<ArvadosModel::AlreadyLockedError: ArvadosModel::AlreadyLockedError>"],"error_token":"1421764492+f7dffd85"}. Retrying at 2015-01-20T14:34:56Z.

History

#1 Updated by Ward Vandewege over 4 years ago

  • Subject changed from [Crunch] crunch-dispatch keeps retrying a locked job to [Crunch] crunch-dispatch appears to be trying to re-run an old job
  • Description updated (diff)

#2 Updated by Brett Smith over 4 years ago

Note that the lock failure log message is coming from crunch-job, not from crunch-dispatch. crunch-dispatch sees the job in the queue, hands it to crunch-job. crunch-job reports back, "I couldn't work on this because I couldn't lock it." crunch-dispatch assumes that means that some other dispatcher got to the job first, so it leaves the record alone in the queue. It refreshes the queue, sees the job still there, tries again, and the loop repeats.

I think the fundamental question here is, how did the job get back on the queue after finishing on the 30th?

#3 Updated by Brett Smith over 4 years ago

  • Category set to Crunch

#4 Updated by Brett Smith over 4 years ago

qr1hi-57u5n-m61q51j3hx5kgzv is the first log entry for the job, on 19 January. It shows that the job was originally created in this odd state: created on 19 January, finished on 30 December, already locked by qr1hi-tpzed-ohrz0hejlvl9g1n, with a log and output set. But it's in Queued state. Right now my top guess is that something went odd with job reuse.

#5 Updated by Brett Smith over 4 years ago

twp: short answer: I ran that job almost certainly by copying out the job definition from a previous job of Bryan's, editing the fields by hand, and running arv-crunch-job
twp: very probably without editing out all of the fields and assuming (hah!) that crunch-job would DTRT with any that it didn't need

We should probably make the Job model/controller stricter somehow, although exactly how is still up for discussion.

I just canceled the job to get it out of the queue and relieve the immediate pain. If you want to see how it looked originally, refer to the log mentioned in my previous comment.

#6 Updated by Brett Smith over 4 years ago

  • Subject changed from [Crunch] crunch-dispatch appears to be trying to re-run an old job to [API] Permits creating+queuing jobs that already have log+output, confusing Crunch
  • Category changed from Crunch to API

#7 Updated by Tom Clegg over 4 years ago

Possible solution:
  • Certain job attributes (output, log) should only be allowed to change (including from null to non-null) when the job is in Running state.
  • Certain job attributes (finished_at) should only be changed while state is changing (from Running to Complete/Failed)

#8 Updated by Tom Clegg over 4 years ago

  • Story points set to 1.0

#9 Updated by Tom Clegg over 4 years ago

  • Subject changed from [API] Permits creating+queuing jobs that already have log+output, confusing Crunch to [API] Prohibit leaving jobs in an inconsistent state (e.g., state=queued with non-null log or output)

#10 Updated by Tom Clegg over 4 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints

Also available in: Atom PDF