Story #2882

Record performance stats in job logs

Added by Tom Clegg over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Start date:
06/03/2014
Due date:
% Done:

100%

Estimated time:
(Total: 16.00 h)
Story points:
3.0

Subtasks

Task #2899: Print performance statistics to stderrResolvedPeter Amstutz

Task #2959: Review 2882-job-process-statsResolvedPeter Amstutz

Task #2898: Collect data from cgroupsResolvedPeter Amstutz

Associated revisions

Revision 270a94c3
Added by Peter Amstutz over 6 years ago

Merge branch 'master' into 2882-job-process-stats refs #2882

Revision f6f4efcf
Added by Peter Amstutz over 6 years ago

Merge branch 'master' into 2882-job-process-stats refs #2882

Revision f0bf04ab
Added by Peter Amstutz over 6 years ago

Merge branch '2882-job-process-stats' refs #2882

Revision 0b4f867e (diff)
Added by Peter Amstutz over 6 years ago

Fixes PipelineInstancesControllerTest. application_controller#create now does
the right thing based on requested response format. refs #2882

Revision 0e7e57bb (diff)
Added by Brett Smith over 6 years ago

2882: Workbench generic create controller adds JSON href attribute.

This is a bugfix commit. This detail got lost when create stopped
calling show to render the object. Refs #2882.

Revision 428973c0 (diff)
Added by Peter Amstutz over 6 years ago

Reverted change a0aba6d "crunch-dispatch now sends a clean environment to crunch-job" refs #2882.

Revision bba95a22
Added by Peter Amstutz over 6 years ago

Merge branch 'origin-2882-job-process-stats' refs #2882

Revision ade6281b (diff)
Added by Peter Amstutz over 6 years ago

Crunchstat logs where it reads its stats from. refs #2882

Revision abbb3136 (diff)
Added by Peter Amstutz over 6 years ago

Bugfix searching for cgroup stats files. refs #2882

Revision 6089a56c (diff)
Added by Peter Amstutz over 6 years ago

Bugfix in logging which cgroup stats files it uses. refs #2882

Revision 557b0b0e (diff)
Added by Peter Amstutz over 6 years ago

Expanded search path for cgroup stats, changed command line interface a bit.
Also adjusted polling interval to every 10 seconds istead of every 1 second.
refs #2882

History

#1 Updated by Ward Vandewege over 6 years ago

  • Subject changed from Show process performance stats (start with CPU load) in Workbench while tasks are running to Show process performance stats (start with CPU load) in Workbench while tasks are running (get data from cgroups)

#2 Updated by Ward Vandewege over 6 years ago

  • Project changed from Umbrella Project to Arvados

#3 Updated by Peter Amstutz over 6 years ago

  • Assigned To set to Peter Amstutz

#5 Updated by Brett Smith over 6 years ago

Reviewing 618a868

  • On the Job status tab, why does no_reuse have to be set using a different method from other attributes?
  • When crunch-dispatch invokes crunch-job with sudo, $PATH needs to be passed along to help crunch-job find Gem binaries like arv. I suspect it also needs $PERLLIB to find the Arvados module for itself.
  • If we're going to manually set the environment this way, removing the -E option from sudo ("preserve environment") would help clarify the intent.
  • In arv-mount's exec handling, rc = 255 needs to be done at the start of the try block. It provides a default value for the exception handlers, so it needs to be set before any exceptions can be raised.
  • arv-mount's current signal handlers can potentially raise errors if a handled signal arrives between the time the child process finishes, and the time arv-mount finishes. Either the original handlers should be restored after the subprocess finishes, or the signal handlers should chain (i.e., check if the child is running; if not, do the default thing).
  • Similarly, should crunchstat's signal handlers proceed with normal handling after it's forwarded to the child?
  • In crunchstat, "elapsed" is the correct spelling.

#6 Updated by Peter Amstutz over 6 years ago

  • no_reuse is a parameter to the create method, not part of the job object
  • See line crunch-dispatch.rb:155, it passes the environment based on the present of the new --use-env option
  • We need the -E because that means "pass sudo's environment to the application", and it's sudo that is the subprocess that is immediately run by Popen
  • Moved up rc = 255
  • Fixed arv-mount to restore the signal handlers after sp.wait().
  • I'm not really sure what the right answer is, here. The signal handling was added based on the observation that crunch-job, when a job is canceled, will first send SIGINT, then SIGTERM, then SIGKILL. SIGKILL can't be masked, but SIGINT and SIGTERM can both be passed along to the underlying process. However, it's not totally clear that the right behavior for arv-mount and crunchstat, on getting SIGTERM, is to SIGTERM, wait, and then SIGKILL the underlying process.
  • Fixed spelling.

I realized that this branch also has the re-run button, which I had posted for review separately, and has comments from Radhika. I should either merge that first, or roll the outstanding re-run branch into this one.

#7 Updated by Peter Amstutz over 6 years ago

  • Subject changed from Show process performance stats (start with CPU load) in Workbench while tasks are running (get data from cgroups) to Record performance stats in job logs

#8 Updated by Brett Smith over 6 years ago

Peter Amstutz wrote:

  • no_reuse is a parameter to the create method, not part of the job object

I think you're thinking of find_or_create? Looking at the API server Jobs controller and associated tests, it never looks for :no_reuse in params, just resource_attrs.

  • See line crunch-dispatch.rb:155, it passes the environment based on the present of the new --use-env option

Does that mean the idea is that we would always use this flag in production? If so, what's the value of making it optional?

  • I'm not really sure what the right answer is, here. The signal handling was added based on the observation that crunch-job, when a job is canceled, will first send SIGINT, then SIGTERM, then SIGKILL. SIGKILL can't be masked, but SIGINT and SIGTERM can both be passed along to the underlying process. However, it's not totally clear that the right behavior for arv-mount and crunchstat, on getting SIGTERM, is to SIGTERM, wait, and then SIGKILL the underlying process.

I agree; I don't see a need for something so involved either. I'm only asking, after they forward the received signal to the child, should they abort as they normally would (i.e., exit(-code) at the end of the handler)?

#9 Updated by Peter Amstutz over 6 years ago

  • On closer look, you're right, no_reuse is not necessary. Originally, reusing a job was the default 'create' behavior and you had to specify 'no_reuse', at some point it was flipped around so that you have to specify find_or_create to get job reuse. So the re-run job buttons don't need to do anything special to get the right behavior.
  • I was assuming that in production, everything is installed systemwide so that everything works with the default environment. Ward says the only blocker is packaging the perl sdk to be systemwide, and we can use --use-env in the meantime.
  • The rationale is that if the underlying process doesn't quit on the first try, but arv-mount and crunchstat do quit, they won't be around to forward any subsequent signals to the underlying process. On background: this feature came up because the crazy nesting of the job inside docker inside crunchstat inside arv-mount broke the ability to cancel jobs, since it would only kill the outermost job. As long as that works most of the time, we should be good.

#10 Updated by Brett Smith over 6 years ago

I think this is good to merge, then. Thanks.

#11 Updated by Peter Amstutz over 6 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF