Project

General

Profile

Actions

Idea #2882

closed

Record performance stats in job logs

Added by Tom Clegg almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Story points:
3.0

Subtasks 3 (0 open3 closed)

Task #2899: Print performance statistics to stderrResolvedPeter Amstutz06/03/2014Actions
Task #2959: Review 2882-job-process-statsResolvedPeter Amstutz06/03/2014Actions
Task #2898: Collect data from cgroupsResolvedPeter Amstutz06/03/2014Actions
Actions #1

Updated by Ward Vandewege almost 10 years ago

  • Subject changed from Show process performance stats (start with CPU load) in Workbench while tasks are running to Show process performance stats (start with CPU load) in Workbench while tasks are running (get data from cgroups)
Actions #2

Updated by Ward Vandewege almost 10 years ago

  • Project changed from 37 to Arvados
Actions #3

Updated by Peter Amstutz almost 10 years ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Brett Smith almost 10 years ago

Reviewing 618a868

  • On the Job status tab, why does no_reuse have to be set using a different method from other attributes?
  • When crunch-dispatch invokes crunch-job with sudo, $PATH needs to be passed along to help crunch-job find Gem binaries like arv. I suspect it also needs $PERLLIB to find the Arvados module for itself.
  • If we're going to manually set the environment this way, removing the -E option from sudo ("preserve environment") would help clarify the intent.
  • In arv-mount's exec handling, rc = 255 needs to be done at the start of the try block. It provides a default value for the exception handlers, so it needs to be set before any exceptions can be raised.
  • arv-mount's current signal handlers can potentially raise errors if a handled signal arrives between the time the child process finishes, and the time arv-mount finishes. Either the original handlers should be restored after the subprocess finishes, or the signal handlers should chain (i.e., check if the child is running; if not, do the default thing).
  • Similarly, should crunchstat's signal handlers proceed with normal handling after it's forwarded to the child?
  • In crunchstat, "elapsed" is the correct spelling.
Actions #6

Updated by Peter Amstutz almost 10 years ago

  • no_reuse is a parameter to the create method, not part of the job object
  • See line crunch-dispatch.rb:155, it passes the environment based on the present of the new --use-env option
  • We need the -E because that means "pass sudo's environment to the application", and it's sudo that is the subprocess that is immediately run by Popen
  • Moved up rc = 255
  • Fixed arv-mount to restore the signal handlers after sp.wait().
  • I'm not really sure what the right answer is, here. The signal handling was added based on the observation that crunch-job, when a job is canceled, will first send SIGINT, then SIGTERM, then SIGKILL. SIGKILL can't be masked, but SIGINT and SIGTERM can both be passed along to the underlying process. However, it's not totally clear that the right behavior for arv-mount and crunchstat, on getting SIGTERM, is to SIGTERM, wait, and then SIGKILL the underlying process.
  • Fixed spelling.

I realized that this branch also has the re-run button, which I had posted for review separately, and has comments from Radhika. I should either merge that first, or roll the outstanding re-run branch into this one.

Actions #7

Updated by Peter Amstutz almost 10 years ago

  • Subject changed from Show process performance stats (start with CPU load) in Workbench while tasks are running (get data from cgroups) to Record performance stats in job logs
Actions #8

Updated by Brett Smith almost 10 years ago

Peter Amstutz wrote:

  • no_reuse is a parameter to the create method, not part of the job object

I think you're thinking of find_or_create? Looking at the API server Jobs controller and associated tests, it never looks for :no_reuse in params, just resource_attrs.

  • See line crunch-dispatch.rb:155, it passes the environment based on the present of the new --use-env option

Does that mean the idea is that we would always use this flag in production? If so, what's the value of making it optional?

  • I'm not really sure what the right answer is, here. The signal handling was added based on the observation that crunch-job, when a job is canceled, will first send SIGINT, then SIGTERM, then SIGKILL. SIGKILL can't be masked, but SIGINT and SIGTERM can both be passed along to the underlying process. However, it's not totally clear that the right behavior for arv-mount and crunchstat, on getting SIGTERM, is to SIGTERM, wait, and then SIGKILL the underlying process.

I agree; I don't see a need for something so involved either. I'm only asking, after they forward the received signal to the child, should they abort as they normally would (i.e., exit(-code) at the end of the handler)?

Actions #9

Updated by Peter Amstutz almost 10 years ago

  • On closer look, you're right, no_reuse is not necessary. Originally, reusing a job was the default 'create' behavior and you had to specify 'no_reuse', at some point it was flipped around so that you have to specify find_or_create to get job reuse. So the re-run job buttons don't need to do anything special to get the right behavior.
  • I was assuming that in production, everything is installed systemwide so that everything works with the default environment. Ward says the only blocker is packaging the perl sdk to be systemwide, and we can use --use-env in the meantime.
  • The rationale is that if the underlying process doesn't quit on the first try, but arv-mount and crunchstat do quit, they won't be around to forward any subsequent signals to the underlying process. On background: this feature came up because the crazy nesting of the job inside docker inside crunchstat inside arv-mount broke the ability to cancel jobs, since it would only kill the outermost job. As long as that works most of the time, we should be good.
Actions #10

Updated by Brett Smith almost 10 years ago

I think this is good to merge, then. Thanks.

Actions #11

Updated by Peter Amstutz almost 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF