Bug #4269
closed[API] Disallow collection UUIDs in script_parameters, only allow portable data hash
Description
One of the unintended side effects of #3036 is that the Collection chooser dialog has been filling in collection record uuids instead of portable data hashes. As a result, most jobs since then contain a uuid instead of a hash in the script_parameters field. This works because arv-mount permits uuids in addition to hashes in --by-id mode (perhaps it shouldn't have). This breaks reproducibility because collection records can be modified after the fact. The science team has already experienced jobs being inappropriately reused due to this problem. Recommend:
- Add a migration that searches past jobs and replaces any instances of collection uuids with the appropriate portable data hash for the collection.
- Add a job model validation that prevents uuids from showing up in script_parameters
- Update legacy code that accepts or tolerates collection UUIDs in script_parameters:
- arv-copy
- (other?)
This is blocked by #4015
Updated by Ward Vandewege about 10 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Tim Pierce about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Tim Pierce about 10 years ago
- Target version changed from 2014-11-19 sprint to 2014-12-10 sprint
Updated by Tim Pierce about 10 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada about 10 years ago
Review feedback:
- Can you call "no_collection_uuids" as "ensure_no_collection_uuids_in_script_params" or something like that?
- May be "return true if recursive_hash_search v, pattern" can be "return recursive_hash_search v, pattern" (same comment for Array)
- Do we want to log these occurrences?
- I like "thing". May be "thing1" and "thing2" even better than "thing" and "pattern" :). Just kidding.
- I am a little concerned about looking for uuid pattern matching in script_parameters. Not sure if I can have a plain text or something like that (like a description) in there where I might enter a collection uuid? Just wondering if we just want to look for collection uuid pattern in script_parameter -> input only? Ignore this question, if it is not meaningful concern for this context.
- Would it make sense to have a fixture with two script parameters (like previous_job_run_with_arvados_sdk_version) and then update the second value (but not changing the uuid) and verify that the update fails? Of course, you said there will be migration coming up soon. So, I am not sure if this will have much value in the long run.
ex: script_parameters: input: xxxxx-xxxxx-blah an_integer: "1"
- It seems like it would be desirable that the uuid pattern matching is a reusable method? I see two places where uuid pattern matching is implemented in the api server code.
- It appears that the test "job validation succeeds when no collection uuid in script_parameters" is covered by other tests in this file (those that use the job_attrs)?
Updated by Tim Pierce about 10 years ago
Radhika Chippada wrote:
Review feedback:
- Can you call "no_collection_uuids" as "ensure_no_collection_uuids_in_script_params" or something like that?
Sure, that's good. Added at 050277d.
- May be "return true if recursive_hash_search v, pattern" can be "return recursive_hash_search v, pattern" (same comment for Array)
No, the recursive hash search shouldn't return right away just because it failed to find the pattern in a single parameter. If that happens, it needs to search the rest of the parameters it finds. This function only returns false if none of the fields in the hash or array it's searching contain the desired pattern.
- Do we want to log these occurrences?
I don't think it's necessary -- these errors should become visible to the user immediately and be traceable. But I'm open to adding logging if we think it would be valuable.
- I like "thing". May be "thing1" and "thing2" even better than "thing" and "pattern" :). Just kidding.
- I am a little concerned about looking for uuid pattern matching in script_parameters. Not sure if I can have a plain text or something like that (like a description) in there where I might enter a collection uuid? Just wondering if we just want to look for collection uuid pattern in script_parameter -> input only? Ignore this question, if it is not meaningful concern for this context.
It's a reasonable concern. I don't like it very much either, but it is consistent with the matching that we do elsewhere for collection uuids, e.g. arv-copy
, which searches the full text of the entire pipeline instance or template for any field that contains a collection uuid anywhere.
I'll ask Tom to confirm whether or not this is a concern.
- Would it make sense to have a fixture with two script parameters (like previous_job_run_with_arvados_sdk_version) and then update the second value (but not changing the uuid) and verify that the update fails? Of course, you said there will be migration coming up soon. So, I am not sure if this will have much value in the long run.
Right -- I don't think there's a lot of value in testing that particular case, since our goal when all is said and done to have a database that's free of inappropriate collection uuids and prevents new ones from being added.
- It seems like it would be desirable that the uuid pattern matching is a reusable method? I see two places where uuid pattern matching is implemented in the api server code.
Yes, at the very least it would help to define a uuid regex in one place and use it consistently throughout. Added ArvadosModel.uuid_regex
at 050277d and replaced hardcoded uuid regexes throughout the code base.
- It appears that the test "job validation succeeds when no collection uuid in script_parameters" is covered by other tests in this file (those that use the job_attrs)?
Yes, that's implicitly satisfied by the other tests. I didn't think it was necessary to add another test that explicitly addresses this case.
Updated by Radhika Chippada about 10 years ago
Tim:
Agree with your responses. And, thanks for updating the uuid regex portion. It will be quite helpful in the long run.
Regarding test "job validation succeeds when no collection uuid in script_parameters", I meant this test is covered by other tests in the file and hence is not needed. However, if you prefer to leave it in, that is fine.
LGTM
Updated by Tim Pierce about 10 years ago
- Status changed from In Progress to Resolved