Bug #2955

Orphan jobs are marked as running forever. Crunch-dispatcher knows which jobs it is actually running and should mark any others as failed.

Added by Peter Amstutz over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Start date:
06/04/2014
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
1.0

Subtasks

Task #2958: Review 2955-fail-orphan-jobsResolvedPeter Amstutz

Task #2956: Fail spurious jobsResolvedPeter Amstutz

Associated revisions

Revision 6a8463ee
Added by Peter Amstutz over 6 years ago

Merge branch '2955-fail-orphan-jobs' closes #2955

History

#1 Updated by Peter Amstutz over 6 years ago

  • Assigned To set to Peter Amstutz

#2 Updated by Tom Clegg over 6 years ago

As of a276e406 (unsolicited review, but hopefully helpful anyway):

With this implementation, if two crunch-dispatch processes are running at the same time (deliberately or otherwise), they will keep marking each other's jobs as failed, even though the crunch-job processes are still running and still being monitored by crunch-dispatch. So this seems to trade a "some jobs refuse to finish" problem for a "no jobs longer than 30 seconds can run" problem.

This can be fixed by implementing the cleanup task as a separate script (say, script/cancel_all_running_jobs.rb). The sysadmin can choose to invoke it just before crunch-dispatch in a runit/daemontools script.

(I'm not sure what the motivation is for checking this periodically while crunch-dispatch is still alive. Is our wait() handling broken? Surely that should be fixed rather than swept under the rug?)

Also: set jobrecord.cancelled_at = Time.now -- that way crunch-job itself (if it is in fact still running somewhere) will clean up and exit, instead of continuing to use up compute resources.

#3 Updated by Peter Amstutz over 6 years ago

  1. Added a check to see if the job has reported any logs in the past 5 minutes. Assuming there is something like crunchstat producing a heartbeat, this should avoid the "multiple crunch-dispatchers canceling each other's jobs" problem
  2. I like the idea of the table being actively managed, instead of just cleaned up at the whim of the admin. For example, if you have multiple crunch-dispatchers and one of them goes down, this allows the second dispatcher to clean up after the crashed one.
  3. Now sets canceled_at.

This would also be a bit less convoluted if we did a similar column cleanup of jobs that we did with pipeline instances.

#4 Updated by Tom Clegg over 6 years ago

At 1ec252c

Peter Amstutz wrote:

  1. Added a check to see if the job has reported any logs in the past 5 minutes. Assuming there is something like crunchstat producing a heartbeat, this should avoid the "multiple crunch-dispatchers canceling each other's jobs" problem

This is better but still not as reliable as "just don't kill jobs that you know you didn't dispatch". AFAICT if it wasn't running when this crunch-dispatch started, and this crunch-dispatch didn't start it, this crunch-dispatch should not touch it. The simple way to achieve this (and one which makes it easy for an admin to run it manually, decide not to use it, etc.) is to move this to a separate script.

Another reason this feature is not suitable for production is that it will kill all users' "dev" jobs (running on VMs) every 30 seconds. So at minimum we need a way to disable it.

  1. I like the idea of the table being actively managed, instead of just cleaned up at the whim of the admin. For example, if you have multiple crunch-dispatchers and one of them goes down, this allows the second dispatcher to clean up after the crashed one.

I don't think multiple dispatchers should be responsible for cleaning up after one another's crashes.

Currently, the only situation where the table is not already actively managed is when crunch-dispatch exits unexpectedly. "Cleaning up" in other situations is always wrong. (What am I missing?)

  1. Now sets canceled_at.

Need to change to cancelled_at

Suggestion for doing this so it can be used in production:

  1. Add dispatched_by_uuid attribute to job
  2. Fill in take and untake methods to set/clear dispatched_by_uuid
  3. Use a different privileged pseudo-user for each crunch-dispatch process (but using system_user like we do now is fine if you only have one crunch-dispatch process)
  4. At startup, kill orphaned jobs that should be monitored by this crunch-dispatch, i.e., unfinished jobs whose dispatched_by_uuid attribute equals current_user.uuid

Suggestion for doing this quickly in a way that solves your immediate problem of having orphaned jobs in a dev environment:

  1. Make a separate script. Run it just before starting crunch-dispatch.

#5 Updated by Tom Clegg over 6 years ago

At a46f015

  • Change :canceled_at to :cancelled_at in arv-run-pipeline-instance
  • The name "refresh_running" could change to something less enigmatic like "cancel_all_running_jobs".

Other than that the cleanup script looks good.

I'd like to know more about the environment variables change, though. It looks like
  • If no flag is specified, only ARVADOS_* and CRUNCH_* get propagated. PATH and others do not. (Doesn't this effectively cripple crunch-job?)
  • If the flag is specified, only ARVADOS_* and CRUNCH_* and the magic list of other stuff gets propagated.
  • There is no way to achieve the previous behavior (propagate everything).

It looks like this is meant to fix some buggy/undesired behavior but I don't know what it is. (Where should I be looking for that?)

Whatever the behaviors are...

It's a bit confusing that we clear the environment, ask "sudo -E" to preserve the entire empty environment, then add env vars to the command line after "sudo -E". Wouldn't it be clearer to build a hash with the appropriate environment, and pass that to popen3 using sudo -E and unsetenv_others? (Or does sudo -E not work as advertised?)

HOME=/dev/null is new to me -- /dev/null is not a directory. Is this better than deleting HOME from env entirely, or setting it to /nonexistent? (I've seen both of those conventions before.)

#6 Updated by Peter Amstutz over 6 years ago

  • Function and script renamed to "cancel_stale_jobs"
  • Fixed canceled_at -> cancelled_at (this turns out to be a legitimate regional difference in spelling: http://grammarist.com/spelling/cancel/)

If you don't clean up the environment, whatever garbage you have in your environment like $DISPLAY, $SSH_AGENT_PID, etc gets propagated to crunch-job. This is undesirable from a provenance and reproducibility standpoint since it adds another thing that could cause a job run change its behavior, and making sure the environment is clean keeps us honest. The "magic list of other stuff" is the minimal set required to run crunch-job if you set up all those environment variables to point to dev paths. In production, all these things should (ideally) be installed to system paths, so it shouldn't need those variables at all. I spoke to Ward, he said the only holdup is packaging the Perl SDK, which he intends to do soon.

It's possible this will break in production in other interesting ways, but that's what 4xphq is for!

The "secure_path" feature of "sudo -E" monkeys around with PATH and (seemingly) other variables for security reasons, so setting it explicitly on the command line was the only way I could get it to work reliably. We might be able to take out the "-E" since we're setting up the environment on the command line, but the current behavior makes it totally explicit to sudo what we're trying to do.

I believe crunch-job breaks if HOME isn't set, but I can't remember what. Setting HOME=/tmp would be fine too.

#7 Updated by Tom Clegg over 6 years ago

Peter Amstutz wrote:

  • Function and script renamed to "cancel_stale_jobs"
  • Fixed canceled_at -> cancelled_at (this turns out to be a legitimate regional difference in spelling: http://grammarist.com/spelling/cancel/)

Great

If you don't clean up the environment, whatever garbage you have in your environment like $DISPLAY, $SSH_AGENT_PID, etc gets propagated to crunch-job. This is undesirable from a provenance and reproducibility standpoint since it adds another thing that could cause a job run change its behavior, and making sure the environment is clean keeps us honest. The "magic list of other stuff" is the minimal set required to run crunch-job if you set up all those environment variables to point to dev paths. In production, all these things should (ideally) be installed to system paths, so it shouldn't need those variables at all. I spoke to Ward, he said the only holdup is packaging the Perl SDK, which he intends to do soon.

Can you clarify what the "no PATH variable" option is for? As far as I can tell the two behaviors one might desire are "leave env alone" and "restrict env to known-necessary ones". I'm missing why we seem to offer the useless "no PATH variable" behavior, and why we make the "leave env alone" option impossible. Am I reading the code wrong?

It's possible this will break in production in other interesting ways, but that's what 4xphq is for!

...when code review fails? :P

The "secure_path" feature of "sudo -E" monkeys around with PATH and (seemingly) other variables for security reasons, so setting it explicitly on the command line was the only way I could get it to work reliably. We might be able to take out the "-E" since we're setting up the environment on the command line, but the current behavior makes it totally explicit to sudo what we're trying to do.

I guess that shouldn't surprise me.

I believe crunch-job breaks if HOME isn't set, but I can't remember what. Setting HOME=/tmp would be fine too.

Also shouldn't surprise me I guess

Thanks

#8 Updated by Peter Amstutz over 6 years ago

Tom Clegg wrote:

Can you clarify what the "no PATH variable" option is for? As far as I can tell the two behaviors one might desire are "leave env alone" and "restrict env to known-necessary ones". I'm missing why we seem to offer the useless "no PATH variable" behavior, and why we make the "leave env alone" option impossible. Am I reading the code wrong?

sudo sets its own minimal environment regardless of whether you use "-E" or not, so there's always a default PATH.

My operating assumption was that the whole reason for "sudo" being there in the first place was to allow crunch-dispatcher and crunch-job to run as different users, in which case it seems obviously preferable to not leak information (in the form of crunch-dispatcher's environment) by default. [As of this writing] Ward is questioning that assumption since in production both crunch-dispatcher and crunch-job currently run as the "crunch" user; in that case we can take out both the "sudo" and all the screwing around with the environment required to work around sudo's behavior, and then everyone would be happy.

#9 Updated by Tom Clegg over 6 years ago

Peter Amstutz wrote:

My operating assumption was that the whole reason for "sudo" being there in the first place was to allow crunch-dispatcher and crunch-job to run as different users, in which case it seems obviously preferable to not leak information (in the form of crunch-dispatcher's environment) by default. [As of this writing] Ward is questioning that assumption since in production both crunch-dispatcher and crunch-job currently run as the "crunch" user; in that case we can take out both the "sudo" and all the screwing around with the environment required to work around sudo's behavior, and then everyone would be happy.

Yes, sudo is is there to let you have separate resource limits and process spaces for crunch-dispatch (which is supposed to be long-running) and crunch-job (which will surely use a ton of RAM on a badly behaved job), taking advantage of things like sudoers and login profiles instead of having to provide our own custom mechanisms. It also makes it possible to isolate (say) apiserver config files from crunch-job processes. Maybe a linux container would be a better way to do this, or just hope for the best until we start running crunch-job on a worker node instead of a controller node.

(Anyway, seems the env stuff is already merged, this doesn't block 2955, this conversation is misfiled, and if I want something different I can start a new issue...)

#10 Updated by Peter Amstutz over 6 years ago

We ran into trouble getting this to work on 4xphq, so the environment stuff is now reverted. Since you already approved the other changes, I will merge the cancel_stale_jobs script.

#11 Updated by Anonymous over 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:6a8463eea9e32cb5cffcaf2f04667520794404ec.

#12 Updated by Ward Vandewege over 6 years ago

  • Story points set to 1.0

Also available in: Atom PDF