Feature #3775

[Crunch] arv-crunch-job should work in local mode with an arvados-hosted repository

Added by Tom Clegg almost 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
08/29/2014
Due date:
% Done:

100%

Estimated time:
(Total: 5.00 h)
Story points:
1.0

Description

Currently, in local mode, repository="arvados" only works if ./arvados/.git happens to exist when crunch-job runs.

New behavior:
  • You can use an arbitrary git URL in your job's "repository" attribute.
  • You can use the name of an Arvados-hosted repository. The named repo's fetch-url will be used. (This is likely to require SSH agent forwarding since the fetch-url is typically given as an SSH path.)
  • You can specify a local filesystem path starting with / or ./ or ../.
  • Compatibility issue: When running jobs in your terminal, "repository":"foo" in a job record used to mean "use ./foo or ./foo/.git". Now it means "use hosted repository called foo" just like it would in a job submitted to the API server. To get the old behavior, you can now use "repository":"./foo".
Unchanged (but relevant) behavior:
  • You can override the job record's "repository" attribute by passing a different URL/name to crunch-job as a --git-dir argument.
Caveats:
  • Remote git URLs still aren't supported when submitting jobs to the API server to run on worker nodes: there, you can (still) only use the name of a hosted repository. New features only work when using arv pipeline run --run-here or arv-crunch-job.

Subtasks

Task #4059: Accept arbitrary git url as repository if --git-dir not givenResolvedTom Clegg

Task #4060: Look up fetch-url of repo via arvados API if bare repo name given in repository attrResolvedTom Clegg

Task #4082: Review 3775-fetch-git-repoResolvedTom Clegg


Related issues

Related to Arvados - Bug #4048: [Crunch] arv-run-pipeline-instance --run-jobs-here cannot run jobs with local script path in script_version.Closed

Associated revisions

Revision 38cc5c0a
Added by Tom Clegg almost 5 years ago

Merge branch '3775-fetch-git-repo' closes #3775

Revision dd7bb176
Added by Tom Clegg almost 5 years ago

Merge branch '3775-fetch-git-repo' closes #3775

Conflicts:
sdk/cli/bin/crunch-job

History

#1 Updated by Tom Clegg almost 5 years ago

  • Subject changed from Make arv-crunch-job work in local mode with an arvados-hosted repository to arv-crunch-job should work in local mode with an arvados-hosted repository
  • Description updated (diff)
  • Category set to Crunch
  • Parent task deleted (#3550)

#2 Updated by Tom Clegg almost 5 years ago

  • Tracker changed from Task to Feature

#3 Updated by Tom Clegg almost 5 years ago

  • Target version changed from 2014-09-17 sprint to Arvados Future Sprints

#4 Updated by Tom Clegg almost 5 years ago

  • Subject changed from arv-crunch-job should work in local mode with an arvados-hosted repository to [Crunch] arv-crunch-job should work in local mode with an arvados-hosted repository

#5 Updated by Tom Clegg almost 5 years ago

  • Story points set to 1.0

#6 Updated by Tom Clegg almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2014-10-08 sprint

#7 Updated by Tom Clegg almost 5 years ago

From c1a82be commit message:

3775: Run local/dev jobs on code from arbitrary remote git repositories.

  • repository is a local path X -> use local repository X or X/.git.
  • repository is a git url (git://, https://, git@host:repo.git, etc)
    fetch branches, tags, and objects from the remote and use the result
    to resolve script_version to a commit sha1 and run the job.
  • else -> look up the named repository in Arvados and use its
    fetch_url as a remote git url as above.
  • --git-dir is given on the command line -> use that instead of
    $Job->{repository} in the above logic. (This isn't new. It ensures
    we don't go off fetching arbitrary remotes when crunch-dispatch.rb
    has already pulled the code into its own internal git repo.)

Incidental changes:

  • Lose support for looking up subversion revision numbers using
    git-svn tags.
  • Lose support for CRUNCH_DEFAULT_GIT_DIR environment variable. Pass
    --git-dir instead.
  • Improve log messages during checkout/install phase.

#8 Updated by Tom Clegg almost 5 years ago

  • Status changed from New to In Progress

#9 Updated by Tim Pierce almost 5 years ago

Review at 5c1f03ef:

Time is short, so there's only one functional change I would like to see right away:

Given that, the code LGTM and I'm okay to merge.

Here's some other stuff that's mostly about clarifying the logic for crunch-job newbies, or longer term improvements (and even if we are ready to dive into rewriting crunch-job, documenting outstanding issues will be important).

  • Suggestion: have cleanup() first perform return unless $Job. Then there's no need to worry about calling it prematurely elsewhere.
  • As a matter of documentation can we update the comment at L385, to describe more precisely what's happening, maybe this?
      # If script_version describes an absolute directory path, and we
      # were not invoked with a --git-dir argument (i.e. not running under
      # crunch-dispatch) then assume that the script_version signifies a
      # file that is available locally, and we should not attempt to resolve
      # it as a git commit.
    
  • Add comment after L392 to state explicitly that we need to resolve script_version as a git commit, possibly from a remote repo.
  • L396: I think it's okay to assume that $Job->{repository} exists because the Job.create method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.
  • At L400: I think the use cases here are:
    • crunch-job was called with --git-dir, which means that we're working from a git repository and script_version may need to be resolved as a commit.
    • script_version is not an absolute path and so needs to be resolved as a git commit; if --git-dir was not given then this requires cloning a remote repository.
    • are there others?
  • L416: this code would allow Job.repository to be a URL of a git repository, not just the name of a Repository resource.
    • If that's only allowed for --git-dir then let's check that argument instead;
    • If it's officially supported to supply a repository URL for Job.repository then let's document it;
    • If it's not officially supported but we're being sneaky here, then let's comment to that effect.
  • L433: use items_available for a more accurate count (even if we don't bother with paging for $n > 0)
  • L474: I'm concerned that there are still edge cases where we could get a bad cached commit for $treeish where we should have fetched new commits (e.g. $treeish is a 'deadbeef' branch and there are commits on that branch whose hash begins with 'deadbeef', or $treeish is a branch in the remote repo that has not been fetched to the local one) Let's add a TODO to make a more comprehensive check that $treeish is not a branch or tag and really only resolves to a single commit.

#10 Updated by Anonymous almost 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 60 to 100

Applied in changeset arvados|commit:38cc5c0a51657c6b60f3d3f32c566845988dfb6b.

#11 Updated by Tom Clegg almost 5 years ago

Tim Pierce wrote:

Done @ dd2a958 (merged).

  • Suggestion: have cleanup() first perform return unless $Job. Then there's no need to worry about calling it prematurely elsewhere.

Done (also done to log_writer_is_active()-before-save_meta(), since it was right next door).

  • As a matter of documentation can we update the comment at L385, to describe more precisely what's happening, maybe this?
    [...]

Updated.

  • Add comment after L392 to state explicitly that we need to resolve script_version as a git commit, possibly from a remote repo.

Added.

  • L396: I think it's okay to assume that $Job->{repository} exists because the Job.create method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.

We don't want to over-validate the job records for jobs that are run locally (if that's what crunch-job is in fact working with, we might as well keep a record of it) but yes, in order for a job to be Queued it should have a repository attribute.

  • At L400: I think the use cases here are:
    • crunch-job was called with --git-dir, which means that we're working from a git repository and script_version may need to be resolved as a commit.
    • script_version is not an absolute path and so needs to be resolved as a git commit; if --git-dir was not given then this requires cloning a remote repository.
    • are there others?

My thinking is it's best to offer the same features regardless of where the repository was specified (command line argument, job record retrieved from API, or job record provided on command line). If a user (or crunch-dispatch) wants to say crunch-job --git-dir git://example.com/foo.git --job '...' then I say we should oblige.

The exception is that I don't want a crunch-job process running under crunch-dispatch.rb to use a working tree that isn't committed anywhere. Since there's no use case for that, it seems like it could only ever be a security risk to allow "script_version is a local path" behavior when the caller has gone to the trouble of giving us a --git-dir.

I suppose the use cases are
  • Caller provides a {local or remote} repo via {job[repository] or --git-dir} and a {sha1 or something that resolves to a sha1 in that repo}
  • Caller provides a path to a working directory

(Does this answer your question?)

  • L416: this code would allow Job.repository to be a URL of a git repository, not just the name of a Repository resource.
    • If that's only allowed for --git-dir then let's check that argument instead;
    • If it's officially supported to supply a repository URL for Job.repository then let's document it;
    • If it's not officially supported but we're being sneaky here, then let's comment to that effect.

Indeed. This was intentional. I've updated the perldoc.

  • L433: use items_available for a more accurate count (even if we don't bother with paging for $n > 0)

Done. I was only thinking of "more than 1 == bad" but now that you mention it, it would surely be disconcerting to hear that you have only 100 repos named xyzzy when in fact you have 1000 of them.

  • L474: I'm concerned that there are still edge cases where we could get a bad cached commit for $treeish where we should have fetched new commits (e.g. $treeish is a 'deadbeef' branch and there are commits on that branch whose hash begins with 'deadbeef', or $treeish is a branch in the remote repo that has not been fetched to the local one) Let's add a TODO to make a more comprehensive check that $treeish is not a branch or tag and really only resolves to a single commit.

I've increased 3 to 7 and expanded the cryptic comment. Now, it's still not inconceivable that you have a locally cached branch called "deadbee" which resolved to commit "deadbee[...]" last time we fetched but what you really want is a different "deadbee" branch or "deadbee[...]" commit that exists on the remote end.

The second case ($treeish is a branch in the remote repo that has not been fetched) is already accounted for: the only time we skip fetch-and-lookup is when $treeish resolves locally to a commit starting with $treeish.

I think this means we are now more paranoid than git itself (where hex-digit branches are silently preferred over commits of which they are prefixes) and I feel like people who are more paranoid than that should either give a longer abbrev or read the logs. OTOH it's just one more git-fetch per "try-it-now", so perhaps at some point we should have a "--paranoid" flag that really goes the distance and even prevents us from acting like git does here:
  • tom@belle:~/src/arvados (5c1f03e)$ git log -n1 --oneline 5c1f03e
    2453adf 3775: Update comment
    tom@belle:~/src/arvados (5c1f03e)$ git log -n1 --oneline 5c1f03ef
    5c1f03e 3775: Merge branch 'master' into 3775-fetch-git-repo
    

Now at b4c27ed (oops) 7a4f566

#12 Updated by Tim Pierce almost 5 years ago

Tom Clegg wrote:

Tim Pierce wrote:

  • L396: I think it's okay to assume that $Job->{repository} exists because the Job.create method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.

We don't want to over-validate the job records for jobs that are run locally (if that's what crunch-job is in fact working with, we might as well keep a record of it) but yes, in order for a job to be Queued it should have a repository attribute.

I'm not sure I understand what you mean by over-validating. This is an API server issue and doesn't have much to do with crunch-job specifically; Job.create already demands that a new job provide repository, script, script_version and script_parameters. Why don't we just move those to the model where they're more transparent to the programmer and and help guarantee database consistency?

  • At L400: I think the use cases here are:
    • crunch-job was called with --git-dir, which means that we're working from a git repository and script_version may need to be resolved as a commit.
    • script_version is not an absolute path and so needs to be resolved as a git commit; if --git-dir was not given then this requires cloning a remote repository.
    • are there others?

...
(Does this answer your question?)

I think so. Mainly I want to satisfy myself that we're not missing any weird outlying situations where crunch-job might need to figure out the repository location some other way. I'm not sure I've satisfied myself, but I can't think of any cases we've missed :-)

  • L474: I'm concerned that there are still edge cases where we could get a bad cached commit for $treeish where we should have fetched new commits (e.g. $treeish is a 'deadbeef' branch and there are commits on that branch whose hash begins with 'deadbeef', or $treeish is a branch in the remote repo that has not been fetched to the local one) Let's add a TODO to make a more comprehensive check that $treeish is not a branch or tag and really only resolves to a single commit.

I've increased 3 to 7 and expanded the cryptic comment. Now, it's still not inconceivable that you have a locally cached branch called "deadbee" which resolved to commit "deadbee[...]" last time we fetched but what you really want is a different "deadbee" branch or "deadbee[...]" commit that exists on the remote end.

That's the kind of thing I'm concerned about. The caller expects their script_version specification to be evaluated against the state of the source repository, but crunch-job wants to be efficient and not fetch the source repo if it doesn't have to. Ideally crunch-job will guarantee that your script_version will be interpreted exactly the way it would against the origin repo, just without pulling if it doesn't have to.

Is there a way to run something like git rev-parse to evaluate a ref against a remote repository, so we could see what the origin thinks "deadbee" means and compare it to what the local repo thinks it means, and fetch new refs only if they differ? That sounds like it would be the definitive check I'm looking for, but my git fu is not yet strong enough.

#13 Updated by Tom Clegg almost 5 years ago

Tim Pierce wrote:

Tom Clegg wrote:

Tim Pierce wrote:

  • L396: I think it's okay to assume that $Job->{repository} exists because the Job.create method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.

We don't want to over-validate the job records for jobs that are run locally (if that's what crunch-job is in fact working with, we might as well keep a record of it) but yes, in order for a job to be Queued it should have a repository attribute.

I'm not sure I understand what you mean by over-validating. This is an API server issue and doesn't have much to do with crunch-job specifically; Job.create already demands that a new job provide repository, script, script_version and script_parameters. Why don't we just move those to the model where they're more transparent to the programmer and and help guarantee database consistency?

For jobs that are run locally, no amount of API server validation can affect what crunch-job does, only what it reports. Since "local" use cases include debugging and development, and aren't expected to be reproducible, it would be awkward for API server to demand that crunch-job supply (for example) a readable repository and extant script_version in cases where crunch-job is not using a repository at all.

I agree, the create action is the wrong place for that validation. It's trivial to bypass it because it hasn't been replicated in the update action, for example. I suspect it got added there just so the "find_or_create" code could assume everything is present. The model does have some validations and they get skipped when state==Running for the above reason. ("If this is what's happening, well, it's too late for us to complain about it now!")

I've increased 3 to 7 and expanded the cryptic comment. Now, it's still not inconceivable that you have a locally cached branch called "deadbee" which resolved to commit "deadbee[...]" last time we fetched but what you really want is a different "deadbee" branch or "deadbee[...]" commit that exists on the remote end.

That's the kind of thing I'm concerned about. The caller expects their script_version specification to be evaluated against the state of the source repository, but crunch-job wants to be efficient and not fetch the source repo if it doesn't have to. Ideally crunch-job will guarantee that your script_version will be interpreted exactly the way it would against the origin repo, just without pulling if it doesn't have to.

Is there a way to run something like git rev-parse to evaluate a ref against a remote repository, so we could see what the origin thinks "deadbee" means and compare it to what the local repo thinks it means, and fetch new refs only if they differ? That sounds like it would be the definitive check I'm looking for, but my git fu is not yet strong enough.

The closest I've found is git ls-remote {giturl} master but it only works for plain tags and branches: you can't look up master^ or a commit hash that way.

Full paranoia means [a] always fetch -- even a 40-character sha1 is not unambiguous because it might be used as a branch name on the remote (and point to a different commit than the commit with that sha1) -- and [b] detect the condition "git would allow this but around here you'll have to delete or rename the deadbee[...] branch in order to select either that branch or the commit with that sha-1".

Perhaps we should implement "full paranoia" mode when we implement this fetch-remote-repo plan on the server side.

Meanwhile do you think the changes at 7a4f566 are good to merge?

#14 Updated by Tom Clegg almost 5 years ago

  • Description updated (diff)

#15 Updated by Tim Pierce almost 5 years ago

Tom Clegg wrote:

Tim Pierce wrote:

Tom Clegg wrote:

Tim Pierce wrote:

  • L396: I think it's okay to assume that $Job->{repository} exists because the Job.create method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.

We don't want to over-validate the job records for jobs that are run locally (if that's what crunch-job is in fact working with, we might as well keep a record of it) but yes, in order for a job to be Queued it should have a repository attribute.

I'm not sure I understand what you mean by over-validating. This is an API server issue and doesn't have much to do with crunch-job specifically; Job.create already demands that a new job provide repository, script, script_version and script_parameters. Why don't we just move those to the model where they're more transparent to the programmer and and help guarantee database consistency?

For jobs that are run locally, no amount of API server validation can affect what crunch-job does, only what it reports. Since "local" use cases include debugging and development, and aren't expected to be reproducible, it would be awkward for API server to demand that crunch-job supply (for example) a readable repository and extant script_version in cases where crunch-job is not using a repository at all.

All I'm focusing on is: the API server requires that these fields not be NULL, so we might as well express that in the database model where it's supposed to be.

If it is acceptable for these fields to be NULL, then of course we shouldn't make that change, but then we should also lift the restriction from Job.create.

Perhaps we should implement "full paranoia" mode when we implement this fetch-remote-repo plan on the server side.

Meanwhile do you think the changes at 7a4f566 are good to merge?

Sounds good.

Thanks!

Also available in: Atom PDF