Feature #3775
closed[Crunch] arv-crunch-job should work in local mode with an arvados-hosted repository
Description
Currently, in local mode, repository="arvados" only works if ./arvados/.git happens to exist when crunch-job runs.
New behavior:- You can use an arbitrary git URL in your job's "repository" attribute.
- You can use the name of an Arvados-hosted repository. The named repo's fetch-url will be used. (This is likely to require SSH agent forwarding since the fetch-url is typically given as an SSH path.)
- You can specify a local filesystem path starting with
/
or./
or../
. - Compatibility issue: When running jobs in your terminal,
"repository":"foo"
in a job record used to mean "use./foo
or./foo/.git
". Now it means "use hosted repository called foo" just like it would in a job submitted to the API server. To get the old behavior, you can now use"repository":"./foo"
.
- You can override the job record's "repository" attribute by passing a different URL/name to crunch-job as a
--git-dir
argument.
- Remote git URLs still aren't supported when submitting jobs to the API server to run on worker nodes: there, you can (still) only use the name of a hosted repository. New features only work when using
arv pipeline run --run-here
orarv-crunch-job
.
Related issues
Updated by Tom Clegg about 10 years ago
- Subject changed from Make arv-crunch-job work in local mode with an arvados-hosted repository to arv-crunch-job should work in local mode with an arvados-hosted repository
- Description updated (diff)
- Category set to Crunch
- Parent task deleted (
#3550)
Updated by Tom Clegg about 10 years ago
- Target version changed from 2014-09-17 sprint to Arvados Future Sprints
Updated by Tom Clegg about 10 years ago
- Subject changed from arv-crunch-job should work in local mode with an arvados-hosted repository to [Crunch] arv-crunch-job should work in local mode with an arvados-hosted repository
Updated by Tom Clegg about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-08 sprint
Updated by Tom Clegg about 10 years ago
From c1a82be commit message:
3775: Run local/dev jobs on code from arbitrary remote git repositories.
- repository is a local path X -> use local repository X or X/.git.
- repository is a git url (git://, https://, git@host:repo.git, etc)
fetch branches, tags, and objects from the remote and use the result
to resolve script_version to a commit sha1 and run the job.
- else -> look up the named repository in Arvados and use its
fetch_url as a remote git url as above.
- --git-dir is given on the command line
> use that instead of>{repository} in the above logic. (This isn't new. It ensures
$Job
we don't go off fetching arbitrary remotes when crunch-dispatch.rb
has already pulled the code into its own internal git repo.)
Incidental changes:
- Lose support for looking up subversion revision numbers using
git-svn tags.
- Lose support for CRUNCH_DEFAULT_GIT_DIR environment variable. Pass
--git-dir instead.
- Improve log messages during checkout/install phase.
Updated by Tim Pierce about 10 years ago
Review at 5c1f03ef:
Time is short, so there's only one functional change I would like to see right away:- L416: this regex should be
m{://|^[^/]*\
@.*:}
to correctly match a remote git repository according to https://www.kernel.org/pub/software/scm/git/docs/git-clone.html#URLS ("This syntax is only recognized if there are no slashes before the first colon. This helps differentiate a local path that contains a colon.")
Given that, the code LGTM and I'm okay to merge.
Here's some other stuff that's mostly about clarifying the logic for crunch-job newbies, or longer term improvements (and even if we are ready to dive into rewriting crunch-job, documenting outstanding issues will be important).
- Suggestion: have
cleanup()
first performreturn unless $Job
. Then there's no need to worry about calling it prematurely elsewhere. - As a matter of documentation can we update the comment at L385, to describe more precisely what's happening, maybe this?
# If script_version describes an absolute directory path, and we # were not invoked with a --git-dir argument (i.e. not running under # crunch-dispatch) then assume that the script_version signifies a # file that is available locally, and we should not attempt to resolve # it as a git commit.
- Add comment after L392 to state explicitly that we need to resolve
script_version
as a git commit, possibly from a remote repo. - L396: I think it's okay to assume that
$Job->{repository}
exists because theJob.create
method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints. - At L400: I think the use cases here are:
- crunch-job was called with
--git-dir
, which means that we're working from a git repository andscript_version
may need to be resolved as a commit. script_version
is not an absolute path and so needs to be resolved as a git commit; if--git-dir
was not given then this requires cloning a remote repository.- are there others?
- crunch-job was called with
- L416: this code would allow
Job.repository
to be a URL of a git repository, not just the name of aRepository
resource.- If that's only allowed for
--git-dir
then let's check that argument instead; - If it's officially supported to supply a repository URL for
Job.repository
then let's document it; - If it's not officially supported but we're being sneaky here, then let's comment to that effect.
- If that's only allowed for
- L433: use
items_available
for a more accurate count (even if we don't bother with paging for $n > 0) - L474: I'm concerned that there are still edge cases where we could get a bad cached commit for $treeish where we should have fetched new commits (e.g. $treeish is a 'deadbeef' branch and there are commits on that branch whose hash begins with 'deadbeef', or $treeish is a branch in the remote repo that has not been fetched to the local one) Let's add a TODO to make a more comprehensive check that $treeish is not a branch or tag and really only resolves to a single commit.
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 60 to 100
Applied in changeset arvados|commit:38cc5c0a51657c6b60f3d3f32c566845988dfb6b.
Updated by Tom Clegg about 10 years ago
Tim Pierce wrote:
- L416: this regex should be
m{://|^[^/]*\
@.*:}
to correctly match a remote git repository according to https://www.kernel.org/pub/software/scm/git/docs/git-clone.html#URLS ("This syntax is only recognized if there are no slashes before the first colon. This helps differentiate a local path that contains a colon.")
Done @ dd2a958 (merged).
- Suggestion: have
cleanup()
first performreturn unless $Job
. Then there's no need to worry about calling it prematurely elsewhere.
Done (also done to log_writer_is_active()
-before-save_meta()
, since it was right next door).
- As a matter of documentation can we update the comment at L385, to describe more precisely what's happening, maybe this?
[...]
Updated.
- Add comment after L392 to state explicitly that we need to resolve
script_version
as a git commit, possibly from a remote repo.
Added.
- L396: I think it's okay to assume that
$Job->{repository}
exists because theJob.create
method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.
We don't want to over-validate the job records for jobs that are run locally (if that's what crunch-job is in fact working with, we might as well keep a record of it) but yes, in order for a job to be Queued it should have a repository attribute.
- At L400: I think the use cases here are:
- crunch-job was called with
--git-dir
, which means that we're working from a git repository andscript_version
may need to be resolved as a commit.script_version
is not an absolute path and so needs to be resolved as a git commit; if--git-dir
was not given then this requires cloning a remote repository.- are there others?
My thinking is it's best to offer the same features regardless of where the repository was specified (command line argument, job record retrieved from API, or job record provided on command line). If a user (or crunch-dispatch) wants to say crunch-job --git-dir git://example.com/foo.git --job '...'
then I say we should oblige.
The exception is that I don't want a crunch-job process running under crunch-dispatch.rb to use a working tree that isn't committed anywhere. Since there's no use case for that, it seems like it could only ever be a security risk to allow "script_version is a local path" behavior when the caller has gone to the trouble of giving us a --git-dir
.
- Caller provides a {local or remote} repo via {job[repository] or --git-dir} and a {sha1 or something that resolves to a sha1 in that repo}
- Caller provides a path to a working directory
(Does this answer your question?)
- L416: this code would allow
Job.repository
to be a URL of a git repository, not just the name of aRepository
resource.
- If that's only allowed for
--git-dir
then let's check that argument instead;- If it's officially supported to supply a repository URL for
Job.repository
then let's document it;- If it's not officially supported but we're being sneaky here, then let's comment to that effect.
Indeed. This was intentional. I've updated the perldoc.
- L433: use
items_available
for a more accurate count (even if we don't bother with paging for $n > 0)
Done. I was only thinking of "more than 1 == bad" but now that you mention it, it would surely be disconcerting to hear that you have only 100 repos named xyzzy when in fact you have 1000 of them.
- L474: I'm concerned that there are still edge cases where we could get a bad cached commit for $treeish where we should have fetched new commits (e.g. $treeish is a 'deadbeef' branch and there are commits on that branch whose hash begins with 'deadbeef', or $treeish is a branch in the remote repo that has not been fetched to the local one) Let's add a TODO to make a more comprehensive check that $treeish is not a branch or tag and really only resolves to a single commit.
I've increased 3 to 7 and expanded the cryptic comment. Now, it's still not inconceivable that you have a locally cached branch called "deadbee" which resolved to commit "deadbee[...]" last time we fetched but what you really want is a different "deadbee" branch or "deadbee[...]" commit that exists on the remote end.
The second case ($treeish is a branch in the remote repo that has not been fetched) is already accounted for: the only time we skip fetch-and-lookup is when $treeish resolves locally to a commit starting with $treeish.
I think this means we are now more paranoid than git itself (where hex-digit branches are silently preferred over commits of which they are prefixes) and I feel like people who are more paranoid than that should either give a longer abbrev or read the logs. OTOH it's just one more git-fetch per "try-it-now", so perhaps at some point we should have a "--paranoid" flag that really goes the distance and even prevents us from acting like git does here:tom@belle:~/src/arvados (5c1f03e)$ git log -n1 --oneline 5c1f03e 2453adf 3775: Update comment tom@belle:~/src/arvados (5c1f03e)$ git log -n1 --oneline 5c1f03ef 5c1f03e 3775: Merge branch 'master' into 3775-fetch-git-repo
Updated by Tim Pierce about 10 years ago
Tom Clegg wrote:
Tim Pierce wrote:
- L396: I think it's okay to assume that
$Job->{repository}
exists because theJob.create
method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.We don't want to over-validate the job records for jobs that are run locally (if that's what crunch-job is in fact working with, we might as well keep a record of it) but yes, in order for a job to be Queued it should have a repository attribute.
I'm not sure I understand what you mean by over-validating. This is an API server issue and doesn't have much to do with crunch-job specifically; Job.create
already demands that a new job provide repository, script, script_version and script_parameters. Why don't we just move those to the model where they're more transparent to the programmer and and help guarantee database consistency?
- At L400: I think the use cases here are:
- crunch-job was called with
--git-dir
, which means that we're working from a git repository andscript_version
may need to be resolved as a commit.script_version
is not an absolute path and so needs to be resolved as a git commit; if--git-dir
was not given then this requires cloning a remote repository.- are there others?
...
(Does this answer your question?)
I think so. Mainly I want to satisfy myself that we're not missing any weird outlying situations where crunch-job might need to figure out the repository location some other way. I'm not sure I've satisfied myself, but I can't think of any cases we've missed :-)
- L474: I'm concerned that there are still edge cases where we could get a bad cached commit for $treeish where we should have fetched new commits (e.g. $treeish is a 'deadbeef' branch and there are commits on that branch whose hash begins with 'deadbeef', or $treeish is a branch in the remote repo that has not been fetched to the local one) Let's add a TODO to make a more comprehensive check that $treeish is not a branch or tag and really only resolves to a single commit.
I've increased 3 to 7 and expanded the cryptic comment. Now, it's still not inconceivable that you have a locally cached branch called "deadbee" which resolved to commit "deadbee[...]" last time we fetched but what you really want is a different "deadbee" branch or "deadbee[...]" commit that exists on the remote end.
That's the kind of thing I'm concerned about. The caller expects their script_version specification to be evaluated against the state of the source repository, but crunch-job wants to be efficient and not fetch the source repo if it doesn't have to. Ideally crunch-job will guarantee that your script_version will be interpreted exactly the way it would against the origin repo, just without pulling if it doesn't have to.
Is there a way to run something like git rev-parse
to evaluate a ref against a remote repository, so we could see what the origin thinks "deadbee" means and compare it to what the local repo thinks it means, and fetch new refs only if they differ? That sounds like it would be the definitive check I'm looking for, but my git fu is not yet strong enough.
Updated by Tom Clegg about 10 years ago
Tim Pierce wrote:
Tom Clegg wrote:
Tim Pierce wrote:
- L396: I think it's okay to assume that
$Job->{repository}
exists because theJob.create
method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.We don't want to over-validate the job records for jobs that are run locally (if that's what crunch-job is in fact working with, we might as well keep a record of it) but yes, in order for a job to be Queued it should have a repository attribute.
I'm not sure I understand what you mean by over-validating. This is an API server issue and doesn't have much to do with crunch-job specifically;
Job.create
already demands that a new job provide repository, script, script_version and script_parameters. Why don't we just move those to the model where they're more transparent to the programmer and and help guarantee database consistency?
For jobs that are run locally, no amount of API server validation can affect what crunch-job does, only what it reports. Since "local" use cases include debugging and development, and aren't expected to be reproducible, it would be awkward for API server to demand that crunch-job supply (for example) a readable repository and extant script_version in cases where crunch-job is not using a repository at all.
I agree, the create
action is the wrong place for that validation. It's trivial to bypass it because it hasn't been replicated in the update
action, for example. I suspect it got added there just so the "find_or_create" code could assume everything is present. The model does have some validations and they get skipped when state==Running
for the above reason. ("If this is what's happening, well, it's too late for us to complain about it now!")
I've increased 3 to 7 and expanded the cryptic comment. Now, it's still not inconceivable that you have a locally cached branch called "deadbee" which resolved to commit "deadbee[...]" last time we fetched but what you really want is a different "deadbee" branch or "deadbee[...]" commit that exists on the remote end.
That's the kind of thing I'm concerned about. The caller expects their script_version specification to be evaluated against the state of the source repository, but crunch-job wants to be efficient and not fetch the source repo if it doesn't have to. Ideally crunch-job will guarantee that your script_version will be interpreted exactly the way it would against the origin repo, just without pulling if it doesn't have to.
Is there a way to run something like
git rev-parse
to evaluate a ref against a remote repository, so we could see what the origin thinks "deadbee" means and compare it to what the local repo thinks it means, and fetch new refs only if they differ? That sounds like it would be the definitive check I'm looking for, but my git fu is not yet strong enough.
The closest I've found is git ls-remote {giturl} master
but it only works for plain tags and branches: you can't look up master^
or a commit hash that way.
Full paranoia means [a] always fetch -- even a 40-character sha1 is not unambiguous because it might be used as a branch name on the remote (and point to a different commit than the commit with that sha1) -- and [b] detect the condition "git would allow this but around here you'll have to delete or rename the deadbee[...] branch in order to select either that branch or the commit with that sha-1".
Perhaps we should implement "full paranoia" mode when we implement this fetch-remote-repo plan on the server side.
Meanwhile do you think the changes at 7a4f566 are good to merge?
Updated by Tim Pierce about 10 years ago
Tom Clegg wrote:
Tim Pierce wrote:
Tom Clegg wrote:
Tim Pierce wrote:
- L396: I think it's okay to assume that
$Job->{repository}
exists because theJob.create
method in the controller demands it, but it's a little confusing that the model doesn't enforce a NOT NULL constraint. I think it's worth adding a story to extend the model to enforce this and other constraints.We don't want to over-validate the job records for jobs that are run locally (if that's what crunch-job is in fact working with, we might as well keep a record of it) but yes, in order for a job to be Queued it should have a repository attribute.
I'm not sure I understand what you mean by over-validating. This is an API server issue and doesn't have much to do with crunch-job specifically;
Job.create
already demands that a new job provide repository, script, script_version and script_parameters. Why don't we just move those to the model where they're more transparent to the programmer and and help guarantee database consistency?For jobs that are run locally, no amount of API server validation can affect what crunch-job does, only what it reports. Since "local" use cases include debugging and development, and aren't expected to be reproducible, it would be awkward for API server to demand that crunch-job supply (for example) a readable repository and extant script_version in cases where crunch-job is not using a repository at all.
All I'm focusing on is: the API server requires that these fields not be NULL, so we might as well express that in the database model where it's supposed to be.
If it is acceptable for these fields to be NULL, then of course we shouldn't make that change, but then we should also lift the restriction from Job.create
.
Perhaps we should implement "full paranoia" mode when we implement this fetch-remote-repo plan on the server side.
Meanwhile do you think the changes at 7a4f566 are good to merge?
Sounds good.
Thanks!