Bug #4269

[API] Disallow collection UUIDs in script_parameters, only allow portable data hash

Added by Peter Amstutz almost 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
-
Target version:
Start date:
12/04/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

One of the unintended side effects of #3036 is that the Collection chooser dialog has been filling in collection record uuids instead of portable data hashes. As a result, most jobs since then contain a uuid instead of a hash in the script_parameters field. This works because arv-mount permits uuids in addition to hashes in --by-id mode (perhaps it shouldn't have). This breaks reproducibility because collection records can be modified after the fact. The science team has already experienced jobs being inappropriately reused due to this problem. Recommend:

  1. Add a migration that searches past jobs and replaces any instances of collection uuids with the appropriate portable data hash for the collection.
  2. Add a job model validation that prevents uuids from showing up in script_parameters
  3. Update legacy code that accepts or tolerates collection UUIDs in script_parameters:
    • arv-copy
    • (other?)

This is blocked by #4015


Subtasks

Task #4725: Review 4269-no-collection-uuid-in-script-paramsResolvedRadhika Chippada

Task #4397: add validationResolvedTim Pierce


Related issues

Related to Arvados - Bug #4756: [API] Add migration to change collection uuids to portable_data_hash in old job script_parametersRejected12/09/2014

Blocked by Arvados - Bug #4015: [Workbench] Collection chooser should be filling in portable_data_hash as well as collection uuid when used to pick pipeline inputs.Resolved10/21/2014

Associated revisions

Revision b09686ba
Added by Tim Pierce almost 5 years ago

Merge branch '4269-no-collection-uuid-in-script-params'

Refs #4269.

History

#1 Updated by Peter Amstutz almost 5 years ago

  • Description updated (diff)

#2 Updated by Peter Amstutz almost 5 years ago

  • Description updated (diff)

#3 Updated by Peter Amstutz almost 5 years ago

  • Description updated (diff)

#4 Updated by Tim Pierce almost 5 years ago

  • Description updated (diff)

#5 Updated by Ward Vandewege almost 5 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints

#6 Updated by Ward Vandewege almost 5 years ago

  • Story points set to 1.0

#7 Updated by Tim Pierce almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2014-11-19 sprint

#8 Updated by Tim Pierce almost 5 years ago

  • Assigned To set to Tim Pierce

#9 Updated by Tim Pierce almost 5 years ago

  • Target version changed from 2014-11-19 sprint to 2014-12-10 sprint

#10 Updated by Tim Pierce almost 5 years ago

  • Status changed from New to In Progress

#11 Updated by Radhika Chippada almost 5 years ago

Review feedback:

  • Can you call "no_collection_uuids" as "ensure_no_collection_uuids_in_script_params" or something like that?
  • May be "return true if recursive_hash_search v, pattern" can be "return recursive_hash_search v, pattern" (same comment for Array)
  • Do we want to log these occurrences?
  • I like "thing". May be "thing1" and "thing2" even better than "thing" and "pattern" :). Just kidding.
  • I am a little concerned about looking for uuid pattern matching in script_parameters. Not sure if I can have a plain text or something like that (like a description) in there where I might enter a collection uuid? Just wondering if we just want to look for collection uuid pattern in script_parameter -> input only? Ignore this question, if it is not meaningful concern for this context.
  • Would it make sense to have a fixture with two script parameters (like previous_job_run_with_arvados_sdk_version) and then update the second value (but not changing the uuid) and verify that the update fails? Of course, you said there will be migration coming up soon. So, I am not sure if this will have much value in the long run.
ex: script_parameters:
    input: xxxxx-xxxxx-blah
    an_integer: "1" 
  • It seems like it would be desirable that the uuid pattern matching is a reusable method? I see two places where uuid pattern matching is implemented in the api server code.
  • It appears that the test "job validation succeeds when no collection uuid in script_parameters" is covered by other tests in this file (those that use the job_attrs)?

#12 Updated by Tim Pierce almost 5 years ago

Radhika Chippada wrote:

Review feedback:

  • Can you call "no_collection_uuids" as "ensure_no_collection_uuids_in_script_params" or something like that?

Sure, that's good. Added at 050277d.

  • May be "return true if recursive_hash_search v, pattern" can be "return recursive_hash_search v, pattern" (same comment for Array)

No, the recursive hash search shouldn't return right away just because it failed to find the pattern in a single parameter. If that happens, it needs to search the rest of the parameters it finds. This function only returns false if none of the fields in the hash or array it's searching contain the desired pattern.

  • Do we want to log these occurrences?

I don't think it's necessary -- these errors should become visible to the user immediately and be traceable. But I'm open to adding logging if we think it would be valuable.

  • I like "thing". May be "thing1" and "thing2" even better than "thing" and "pattern" :). Just kidding.
  • I am a little concerned about looking for uuid pattern matching in script_parameters. Not sure if I can have a plain text or something like that (like a description) in there where I might enter a collection uuid? Just wondering if we just want to look for collection uuid pattern in script_parameter -> input only? Ignore this question, if it is not meaningful concern for this context.

It's a reasonable concern. I don't like it very much either, but it is consistent with the matching that we do elsewhere for collection uuids, e.g. arv-copy, which searches the full text of the entire pipeline instance or template for any field that contains a collection uuid anywhere.

I'll ask Tom to confirm whether or not this is a concern.

  • Would it make sense to have a fixture with two script parameters (like previous_job_run_with_arvados_sdk_version) and then update the second value (but not changing the uuid) and verify that the update fails? Of course, you said there will be migration coming up soon. So, I am not sure if this will have much value in the long run.

Right -- I don't think there's a lot of value in testing that particular case, since our goal when all is said and done to have a database that's free of inappropriate collection uuids and prevents new ones from being added.

  • It seems like it would be desirable that the uuid pattern matching is a reusable method? I see two places where uuid pattern matching is implemented in the api server code.

Yes, at the very least it would help to define a uuid regex in one place and use it consistently throughout. Added ArvadosModel.uuid_regex at 050277d and replaced hardcoded uuid regexes throughout the code base.

  • It appears that the test "job validation succeeds when no collection uuid in script_parameters" is covered by other tests in this file (those that use the job_attrs)?

Yes, that's implicitly satisfied by the other tests. I didn't think it was necessary to add another test that explicitly addresses this case.

#13 Updated by Radhika Chippada almost 5 years ago

Tim:

Agree with your responses. And, thanks for updating the uuid regex portion. It will be quite helpful in the long run.

Regarding test "job validation succeeds when no collection uuid in script_parameters", I meant this test is covered by other tests in the file and hence is not needed. However, if you prefer to leave it in, that is fine.

LGTM

#14 Updated by Tim Pierce almost 5 years ago

Moved the migration work to #4756.

#15 Updated by Tim Pierce almost 5 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF