Project

General

Profile

Actions

Bug #6598

closed

[Crunch] Fix crunch-job's update_progress_stats post-5717

Added by Brett Smith almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5

Description

crunch-job's update_proress_stats function updates the job's tasks summary. It assumes that the number of running jobs is the total number of SLURM slots available to the job, minus the number of slots unused (because they're free or being held due to node failures). After #5717, this math is no longer accurate: when few tasks exist at a level, crunch-job may use few a limited number of slots at that level. The math expects those slots are running jobs, but they're not.

Update the function to calculate a new "running" number based on a more accurate measure, like maybe scalar(keys(%proc)).


Subtasks 2 (0 open2 closed)

Task #6745: Review 6598-crunch-progress-statsResolvedTom Clegg07/13/2015Actions
Task #6746: Fix summary calculationResolvedTom Clegg07/13/2015Actions
Actions #1

Updated by Brett Smith almost 9 years ago

  • Target version changed from 2015-08-19 sprint to 2015-08-05 sprint
Actions #2

Updated by Tom Clegg almost 9 years ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg over 8 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Tom Clegg over 8 years ago

Tested a76d715 on 4xphq.

Before:

https://workbench.4xphq.arvadosapi.com/collections/58f9d7718475a24e87f73e413b1477d9+85/4xphq-8i9sb-04u5bq3yrmkvhrs.log.txt

2015-07-31_16:58:49 4xphq-8i9sb-04u5bq3yrmkvhrs 14689  start level 0 with 1 slots
2015-07-31_16:58:50 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 status: 0 done, 7 running, 1 todo
2015-07-31_16:58:50 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 0 job_task 4xphq-ot0gb-m6kfhcshuenxvir
2015-07-31_16:58:50 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 0 child 17941 started on compute1.1
2015-07-31_16:58:50 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 0 stderr starting: ['srun','--nodelist= .....
2015-07-31_16:58:51 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 status: 0 done, 8 running, 0 todo
2015-07-31_16:58:51 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 0 stderr Running [docker.io run .....

After:

https://workbench.4xphq.arvadosapi.com/collections/403a43f6261ca34a0a84d0dc6b153dea+85/4xphq-8i9sb-58u6wekhujgxur9.log.txt

2015-07-31_17:08:48 4xphq-8i9sb-58u6wekhujgxur9 7793  start level 0 with 1 slots
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 status: 0 done, 0 running, 1 todo
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 0 job_task 4xphq-ot0gb-z37h8tgwgggqo7p
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 0 child 8189 started on compute1.1
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 0 stderr starting: ['srun','--nodelist= ......
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 status: 0 done, 1 running, 0 todo
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 0 stderr Running [docker.io run .......
Actions #5

Updated by Brett Smith over 8 years ago

a76d715 is good to merge. Thank you.

Actions #6

Updated by Tom Clegg over 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:6988f4d44d2f8f7fc4aa2c381334c44d3133cf31.

Actions

Also available in: Atom PDF