Bug #11590

job reuse / re-use logic does not log decisions

Added by Joshua Randall 3 months ago. Updated 2 months ago.

Status:ResolvedStart date:04/28/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:Crunch
Target version:2017-05-24 sprint
Story points1.0Remaining (hours)0.00 hour
Velocity based estimate-

Description

It would be helpful if job reuse logic produced some log output to help understand why job output was or was not reused.

For example, we might have entries such as:

"no candidates for job reuse"
"no candidates for job reuse after filtering"
"job reuse candidates had inconsistent output, running a new job"
"job reuse candidate found, using existing output from xxxxx-xxxxx-xxxxxxxxxxx"

Ideally this might be included in crunch-dispatch logs (or even in some cases in job logs?), but as the decisions seem to be made by the API server, they could at least be logged there (i.e. in production.log).


Subtasks

Task #11728: Review 11590-log-reuseResolvedTom Clegg

Task #11683: Add logging to container reuse codeResolvedTom Clegg

Task #11663: ReviewResolvedTom Clegg

Associated revisions

Revision ed56c267
Added by Tom Clegg 3 months ago

Merge branch '11590-log-reuse'

refs #11590

Revision df55660f
Added by Tom Clegg 2 months ago

Merge branch '11590-log-reuse'

refs #11590

Revision a9b15a2f
Added by Tom Clegg 2 months ago

Merge branch '11590-log-reuse'

closes #11590

History

#1 Updated by Tom Morris 3 months ago

  • Target version set to 2017-05-24 sprint

#2 Updated by Tom Clegg 3 months ago

For crunch1 and crunch2: add a config flag "log_reuse_decisions". If true, in the "find reusable" code, build the ActiveRelation one where() at a time, and log a count() after each one.

For crunch1 only: add a "reuse_job_if_outputs_differ" flag, default false. If true, return "chosen" as soon as it's chosen, instead of waiting for a conflicting-output job.

#3 Updated by Tom Clegg 3 months ago

  • Story points set to 1.0

#4 Updated by Tom Clegg 3 months ago

  • Assignee set to Tom Clegg

#5 Updated by Tom Clegg 3 months ago

  • Status changed from New to In Progress

Example logs

find_reusable: have 29 readable jobs in a reusable state
find_reusable: have 2 candidates after filtering on script_parameters_digest 03a43a7d84f7fb022467b876c2950acd
find_reusable: have 2 candidates after filtering on !nondeterministic
find_reusable: have 2 candidates after filtering on repo, script, and custom filters [["repository", "=", "active/foo"], ["script", "=", "hash"], ["docker_image_locator", "=", nil]]
find_reusable: job zzzzz-8i9sb-aceg2bnq7jt7kon output fa7aeb5140e2848d39b416daeef4ffc5+45 can be reused; continuing search in case others disagree
find_reusable: job zzzzz-8i9sb-nj8ioxnrvjtyk2b output aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa+1 disagrees; forgetting about zzzzz-8i9sb-aceg2bnq7jt7kon and ignoring any other finished jobs (see reuse_job_if_outputs_differ in application.default.yml)
find_reusable: done, nothing suitable

#6 Updated by Tom Clegg 3 months ago

  # Include details about job reuse decisions in the server log. This
  # causes additional database queries to run, so it should not be
  # enabled unless you expect to examine the resulting logs for
  # troubleshooting purposes.
  log_reuse_decisions: false

  # Control job reuse behavior when two completed jobs match the
  # search criteria and have different outputs.
  #
  # If true, in case of a conflict, reuse the earliest job (this is
  # similar to container reuse behavior).
  #
  # If false, in case of a conflict, do not reuse any completed job,
  # but do reuse an already-running job if available (this is the
  # original job reuse behavior, and is still the default).
  reuse_job_if_outputs_differ: false

#8 Updated by Radhika Chippada 3 months ago

  • reuselog => log_reuse_decision or log_reuse_plan?
  • The following seems like it could be an extremely confusing log message. This is logging the total number of jobs that are Complete or Queued or Running (which could be very large number) without any consideration at all of any of the search criteria given, of which all of them can be completely unrelated. It doesn't seem like we should be logging this. Otherwise, please consider rewording something like "there are 'n' number of jobs readable that are either complete or queued or running, and will check if any of them can be candidates for reuse ..."
    candidates = Job.readable_by(current_user).where(
      'state = ? or (owner_uuid = ? and state in (?))',
      Job::Complete, current_user.uuid, [Job::Queued, Job::Running])
    reuselog { "have #{candidates.count} readable jobs in a reusable state" }
  • "job #{j.uuid} is reusable, but unfinished, so not better than #{incomplete_job.uuid}; ignoring" => another unfinished and reusable job #{j.uuid}, but not better than ... ?
  • "job #{j.uuid} output #{j.output} ignored, see above" => "job #{j.uuid} WITH output ..."
  • "job #{j.uuid} output #{j.output} disagrees ... " => "job #{j.uuid} WITH output ..."
  • " ... continuing search in case others disagree" => continuing search in case conflicting jobs and outputs exist?
  • "job #{j.uuid} output #{j.output} unavailable to user; now refusing to reuse any finished job" => this a little confusing. should we say due to the Rails.configuration.reuse_job_if_outputs_differ setting, we will not be reusing any of the completed jobs?
  • "done, #{j.uuid} was selected" and "done, nothing suitable" => can be a bit more verbose. something like " done searching for a reusable job and found / not fount ..."
  • Rails.configuration.reuse_job_if_outputs_differ => reuse_job_EVEN_WHEN_outputs_differ ?

#9 Updated by Tom Clegg 3 months ago

Updated log messages:

find_reusable: starting with 34 jobs readable by current user zzzzz-tpzed-xurymjxw79nv3jz
find_reusable: have 29 candidates after filtering on job state (either state=Complete, or state=Queued/Running and submitted by current user)
find_reusable: have 4 candidates after filtering on script_parameters_digest a5f03bbfb8ba88a2efe4a7852671605b
find_reusable: have 3 candidates after filtering on !nondeterministic
find_reusable: have 1 candidates after filtering on repo, script, and custom filters [["repository", "=", "active/foo"], ["script", "=", "hash"], ["script_version", "=", "4fe459abe02d9b365932b8f5dc419439ab4e2577"], ["docker_image_locator", "=", nil]]
find_reusable: job zzzzz-8i9sb-cjs4pklxxjykqqq with output ea10d51bcf88862dbcc36eb292017dfd+45 can be reused; continuing search in case other candidates have different outputs
find_reusable: done, zzzzz-8i9sb-cjs4pklxxjykqqq was selected
find_reusable: starting with 34 jobs readable by current user zzzzz-tpzed-xurymjxw79nv3jz
find_reusable: have 29 candidates after filtering on job state (either state=Complete, or state=Queued/Running and submitted by current user)
find_reusable: have 3 candidates after filtering on script_parameters_digest 174dd339d44f2b259fadbab7ebdb8df9
find_reusable: have 3 candidates after filtering on !nondeterministic
find_reusable: have 2 candidates after filtering on repo, script, and custom filters [["repository", "=", "active/foo"], ["script", "=", "hash"], ["script_version", "=", "4fe459abe02d9b365932b8f5dc419439ab4e2577"], ["docker_image_locator", "=", nil]]
find_reusable: job zzzzz-8i9sb-ahd7cie8jah9qui with output ea10d51bcf88862dbcc36eb292017dfd+45 can be reused; continuing search in case other candidates have different outputs
find_reusable: job zzzzz-8i9sb-cjs4pklxxjykppp output  disagrees; forgetting about zzzzz-8i9sb-ahd7cie8jah9qui and ignoring any other finished jobs (see reuse_job_if_outputs_differ in application.default.yml)
find_reusable: done, nothing suitable

#10 Updated by Radhika Chippada 2 months ago

have 29 candidates after filtering on job state (either state=Complete, or state=Queued/Running and submitted by current user)

I am not sure if it is clear enough that the "submitted by current user" is applicable only to state=Queued/Running or state=Complete as well, without reading it multiple times. Can we say (either state=Complete or submitted by current user in state=Queued/Running to be clear?

LGTM

#11 Updated by Tom Clegg 2 months ago

updated to

"have #{candidates.count} candidates after filtering on job state ((state=Complete) or (state=Queued/Running and (submitted by current user)))" 

#12 Updated by Tom Clegg 2 months ago

Container reuse logging samples

find_reusable: starting with 14 container records in database
find_reusable: have 14 candidates after filtering on command ["echo", "hello"]
find_reusable: have 14 candidates after filtering on cwd "test" 
find_reusable: have 2 candidates after filtering on environment {"var"=>"complete"}
find_reusable: have 2 candidates after filtering on output_path "test" 
find_reusable: have 2 candidates after filtering on container_image "9ae44d5792468c58bcf85ce7353c7027+124" 
find_reusable: have 2 candidates after filtering on mounts {"test"=>{"kind"=>"json"}}
find_reusable: have 2 candidates after filtering on runtime_constraints {"keep_cache_ram"=>268435456, "ram"=>12000000000, "vcpus"=>4}
find_reusable: have 2 with state=Complete, exit_code=0
find_reusable: have 2 with log readable by current user zzzzz-tpzed-000000000000000
find_reusable: have 2 with output readable by current user zzzzz-tpzed-000000000000000
find_reusable: done, reusing completed container zzzzz-dz642-cwpvv44a5ys2pkj

find_reusable: starting with 14 container records in database
find_reusable: have 14 candidates after filtering on command ["echo", "hello"]
find_reusable: have 14 candidates after filtering on cwd "test" 
find_reusable: have 2 candidates after filtering on environment {"var"=>"failed_vs_running"}
find_reusable: have 2 candidates after filtering on output_path "test" 
find_reusable: have 2 candidates after filtering on container_image "9ae44d5792468c58bcf85ce7353c7027+124" 
find_reusable: have 2 candidates after filtering on mounts {"test"=>{"kind"=>"json"}}
find_reusable: have 2 candidates after filtering on runtime_constraints {"ram"=>12000000000, "vcpus"=>4}
find_reusable: have 0 with state=Complete, exit_code=0
find_reusable: have 0 with log readable by current user zzzzz-tpzed-000000000000000
find_reusable: have 0 with output readable by current user zzzzz-tpzed-000000000000000
find_reusable: done, reusing container zzzzz-dz642-dl6lxlw5l3m61z3 with state=Running

find_reusable: starting with 12 container records in database
find_reusable: have 0 candidates after filtering on command ["echo", "foo"]
find_reusable: have 0 candidates after filtering on cwd "." 
find_reusable: have 0 candidates after filtering on environment {}
find_reusable: have 0 candidates after filtering on output_path "/tmp" 
find_reusable: have 0 candidates after filtering on container_image "fa3c1a9cb6783f85f2ecda037e07b8c3+167" 
find_reusable: have 0 candidates after filtering on mounts {}
find_reusable: have 0 candidates after filtering on runtime_constraints {"keep_cache_ram"=>268435456, "vcpus"=>1, "ram"=>1}
find_reusable: have 0 with state=Complete, exit_code=0
find_reusable: have 0 with log readable by current user zzzzz-tpzed-000000000000000
find_reusable: have 0 with output readable by current user zzzzz-tpzed-000000000000000
find_reusable: have no containers in Running state
find_reusable: have no containers in Locked or Queued state
find_reusable: done, no reusable container found

11590-log-reuse @ d95ccc6bbbf94e957b939ca9d9aea71cda345944

#13 Updated by Radhika Chippada 2 months ago

  • Do we want to log the entire command and mounts? Would the log messages become too big and unreadable? Can we instead print first few lines of these attributes?
  • “find_reusable: have 2 with log readable by current user zzzzz-tpzed-000000000000000” => here is the current user going to be an actual end user (not the system user)?
  • "have #{candidates.count} candidates after filtering on job state ((state=Complete) or (state=Queued/Running and (submitted by current user)))" => "have #{candidates.count} candidates after filtering on job state ((state = Complete) or (state = Queued/Running that are submitted by #{current_user.uuid} (submitted by current user) ))” might be easier to read and understand

LGTM

#14 Updated by Tom Clegg 2 months ago

Radhika Chippada wrote:

  • Do we want to log the entire command and mounts? Would the log messages become too big and unreadable? Can we instead print first few lines of these attributes?

Punting on this -- this logging is only enabled while debugging, so limiting detail might do as much harm as good.

  • “find_reusable: have 2 with log readable by current user zzzzz-tpzed-000000000000000” => here is the current user going to be an actual end user (not the system user)?

Updated this message to reflect the fact that permissions don't come into it (current_user is always system_user at this point).

#15 Updated by Tom Clegg 2 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:a9b15a2f041744c8fac8c0b12fa8b61d1a7292f8.

Also available in: Atom PDF