Bug #3168

[Crunch] Source git repository for a job should not be fetched into internal arvados repository over and over again even when requested node profile is not available

Added by Peter Amstutz about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
09/25/2014
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
0.5

Description

It fetches the git repository into the Arvados internal repository even if the desired commit is already available in the internal repository. It should test to see if the commit is available in the internal repository, and only run "git fetch-pack" when the commit is not available. This will also suppress the "fetch over and over" behavior.


Subtasks

Task #3986: Review 3168-crunch-git-fetchResolvedPeter Amstutz

Associated revisions

Revision c2c38069
Added by Peter Amstutz almost 5 years ago

Merge branch '3168-crunch-git-fetch' closes #3168

History

#1 Updated by Tom Clegg about 5 years ago

  • Target version set to Arvados Future Sprints

#2 Updated by Peter Amstutz about 5 years ago

  • Subject changed from Git repository for jobs is deployed over and over again even when requested node is busy to Source git repository for a job is fetched into internal arvados repository over and over again even when requested node is busy
  • Description updated (diff)

#3 Updated by Tom Clegg about 5 years ago

  • Description updated (diff)

#4 Updated by Brett Smith about 5 years ago

  • Description updated (diff)

(Split the second half of this bug into #3278, since they're unrelated.)

#5 Updated by Tim Pierce almost 5 years ago

  • Story points set to 0.5

#6 Updated by Tom Clegg almost 5 years ago

  • Subject changed from Source git repository for a job is fetched into internal arvados repository over and over again even when requested node is busy to [Crunch] Source git repository for a job should not be fetched into internal arvados repository over and over again even when requested node profile is not available

#7 Updated by Tom Clegg almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2014-10-08 sprint

#8 Updated by Peter Amstutz almost 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz

#9 Updated by Tom Clegg almost 5 years ago

At 9a15397

  • It would be nice to avoid repeating that "print cmd; print cmd output; if error then sleep 1 untake next" template 3 times. But I guess I can live with it for now.
  • Rather than running git N times for N scheduling attempts, could we just run it once, and remember "script_version X has been fetched" in a hash table? (I thought we had already done something similar for API tokens, but I don't see it -- is that still broken too?)
  • At unless tag_rev == job.script_version -- If a script_version changes after an unsuccessful bid to start the job, does that leave the job stuck in the queue because the git-tag will always fail-and-retry from now on? Would it be better to fail the job here?
    • Another alternative would be to force-update the tag, but I think that creates a race bug where the winner of the "start the job" race uses version A but the git repo ends up tagged at version B by the loser.
    • Perhaps a check for "is the revision tagged correctly" should be done in crunch-job, after the job has been locked, to reliably detect races.

#10 Updated by Tom Clegg almost 5 years ago

  • Category set to Crunch

#11 Updated by Peter Amstutz almost 5 years ago

Tom Clegg wrote:

At 9a15397

  • It would be nice to avoid repeating that "print cmd; print cmd output; if error then sleep 1 untake next" template 3 times. But I guess I can live with it for now.

Yea, I'm not thrilled by the repetition either, but it would require some more aggressive refactoring that I was willing to do in the context of this (fairly narrow) bug fix.

  • Rather than running git N times for N scheduling attempts, could we just run it once, and remember "script_version X has been fetched" in a hash table? (I thought we had already done something similar for API tokens, but I don't see it -- is that still broken too?)

Although if you have multiple crunch jobs, you'll still be calling git multiple times. While forking to call git is kind of heavyweight compared to doing a hash table lookup, keeping a cache means more code and introduces the "multiple sources of truth" problem.

  • At unless tag_rev == job.script_version -- If a script_version changes after an unsuccessful bid to start the job, does that leave the job stuck in the queue because the git-tag will always fail-and-retry from now on? Would it be better to fail the job here?

You're right, it would be better in the case of a tag collision to fail instead of linger. Fixed in 1d8f975.

A tag collision should never happen unless someone is messing around with the internal git repository and/or the job record. I wrote up #3988 about protecting the job record better. I also added #4003 since crunch dispatch errors (such as git errors) don't go to the logs table so are not visible to the user (the job would just fail silently).

#12 Updated by Tom Clegg almost 5 years ago

Peter Amstutz wrote:

Yea, I'm not thrilled by the repetition either, but it would require some more aggressive refactoring that I was willing to do in the context of this (fairly narrow) bug fix.

Fair enough

Although if you have multiple crunch jobs, you'll still be calling git multiple times. While forking to call git is kind of heavyweight compared to doing a hash table lookup, keeping a cache means more code and introduces the "multiple sources of truth" problem.

Given that we're talking about "has a sha1 been added to this repo?" and the whole point of that repo is that commits, once added, do not disappear, I figured a hash would be extremely safe. (And by "once" I meant one time per commit sha1, not one time ever!)

But, OK. Don't need to hold up the bugfix for it.

  • At unless tag_rev == job.script_version -- If a script_version changes after an unsuccessful bid to start the job, does that leave the job stuck in the queue because the git-tag will always fail-and-retry from now on? Would it be better to fail the job here?

You're right, it would be better in the case of a tag collision to fail instead of linger. Fixed in 1d8f975.

A tag collision should never happen unless someone is messing around with the internal git repository and/or the job record. I wrote up #3988 about protecting the job record better. I also added #4003 since crunch dispatch errors (such as git errors) don't go to the logs table so are not visible to the user (the job would just fail silently).

LGTM, except: I think save! (2×) is not correct (even though we do use it elsewhere). Every time we use save! we risk exiting crunch-dispatch and losing supervision/logging of whatever other jobs happen to be running right now. Is this error really worth crashing on? Where possible, we should stick with save or recover -- even if recover is just "log something and continue" -- instead of save or crash.

#13 Updated by Tom Clegg almost 5 years ago

LGTM @ d592a37

#14 Updated by Anonymous almost 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:c2c38069d28fc68dea6e1b2cb0d5f4f36e1ef03f.

Also available in: Atom PDF