Story #3410

[API] Rename collection "desired replication" attributes, add model constraints

Added by Misha Zatsman over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
07/29/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Each collection record should have a field corresponding to the level of replication that the owner desires.

It's okay to leave this field blank, in which case the default value will be assumed.

The default replication level for collections should be a systemwide default which the api server can return.

Implementation:
  • Rename redundancy -> replication_desired
  • Rename redundancy_confirmed -> replication_confirmed
  • Rename redundancy_confirmed_at -> replication_confirmed_at
  • Drop redundancy_confirmed_by_client_uuid column
  • Delete redundancy_status method from models/collection.rb
  • Prevent non-privileged users from updating replication_confirmed* columns
  • When a collection's manifest_text changes in a way that introduces new block hashes (that weren't in the old manifest_text), reset redundancy_confirmed* columns to NULL
  • Add a config entry in services/api/config/application.default.yml, default_collection_replication: 2
  • Expose the config entry in the discovery document (as defaultCollectionReplication at top level?)

Subtasks

Task #5100: Write migrationResolvedTom Clegg

Task #5101: Write testsResolvedTom Clegg

Task #5165: Review 3410-replication-attrsResolvedRadhika Chippada


Related issues

Related to Arvados - Bug #5685: [SDKs] In Go SDK keepclient, use defaultCollectionReplication as default Want_replicas (instead of always 2)Resolved

Blocks Arvados - Story #3408: [Keep] Implement Production Data ManagerResolved07/29/2014

Blocks Arvados - Feature #5011: [SDKs] arv-put accepts --replication argument (saved with collection record)Resolved01/30/2015

Blocks Arvados - Task #5124: Use server-provided default replication level as default in arv-put and CollectionWriter classesResolved01/30/2015

Associated revisions

Revision d87717b4
Added by Tom Clegg almost 5 years ago

Merge branch '3410-replication-attrs' closes #3410 refs #5011

History

#1 Updated by Ward Vandewege over 5 years ago

  • Target version set to 2014-08-27 Sprint

#2 Updated by Ward Vandewege over 5 years ago

  • Subject changed from Store desired replication level in collection to [API] Store desired replication level in collection

#3 Updated by Ward Vandewege over 5 years ago

  • Story points set to 1.0

#4 Updated by Peter Amstutz over 5 years ago

  • Assigned To set to Peter Amstutz

#5 Updated by Tim Pierce over 5 years ago

  • Category set to API

#6 Updated by Tom Clegg over 5 years ago

  • Target version changed from 2014-08-27 Sprint to Arvados Future Sprints

#7 Updated by Tom Clegg over 5 years ago

  • Subject changed from [API] Store desired replication level in collection to [API] Store desired replication level in collections table
  • Description updated (diff)
  • Assigned To deleted (Peter Amstutz)

#8 Updated by Ward Vandewege about 5 years ago

  • Target version changed from Arvados Future Sprints to 2015-02-18 sprint

#9 Updated by Peter Amstutz almost 5 years ago

  • Assigned To set to Peter Amstutz

#10 Updated by Peter Amstutz almost 5 years ago

What's the difference between redundancy_confirmed and redundancy_confirmed_at ? If redundancy_confirmed is just a boolean, then only redundancy_confirmed_at is needed?

#11 Updated by Tom Clegg almost 5 years ago

  • Subject changed from [API] Store desired replication level in collections table to [API] Rename collection "desired replication" attributes, add model constraints

#12 Updated by Tom Clegg almost 5 years ago

Peter Amstutz wrote:

What's the difference between redundancy_confirmed and redundancy_confirmed_at ? If redundancy_confirmed is just a boolean, then only redundancy_confirmed_at is needed?

replication_confirmed is the number of copies.

#13 Updated by Tom Clegg almost 5 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg

#14 Updated by Radhika Chippada almost 5 years ago

  • I am understanding this implementation as follows. For each collection, replication_desired can be set (by admin); or the default will be used. Whatever is this value (default or per collection), we will initiate the replication process when the replication_confirmed != replication_desired. My comments are based on this understanding. Please correct if this is not the case.
  • I think the name “replication_current” might have been clearer over “replication_confirmed”.
  • when replication_desired is changed, do we need to reset replication_confirmed and replication_confirmed_at? I am thinking this can be set by admin after the initial collection creation and replication_confirmed etc are completed. Or can be updated after setting to a value previously. In either case, I am assuming we would need to replicate additional copies or remove replications as needed. Does this get initiated because replication_desired!= replication_confirmed or do we need to reset these attributes to initiate the replication process?
  • maybe_clear_redundancy_confirmed - should this be using the word “replication” in place of “redundancy”?
  • ensure_permission_to_save method

Why do we want non-admin users be able to clear replication_confirmed* when they cannot set them? Should the following be:

    if (not current_user.andand.is_admin and
        (replication_confirmed_at_changed? or replication_confirmed_changed?) and
        not (replication_confirmed_at.nil? and replication_confirmed.nil?))

this:

    if (not current_user.andand.is_admin and
        (replication_confirmed_at_changed? or replication_confirmed_changed?))
  • config/application.default.yml
    can we update comment — “Set replication level for collections whose replication_desired attribute is nil” => “Default replication level for collections. This will be used when replication_desired attribute is nil.”
  • migration script -> comment in up method
    “Removing that column dropped the index. Let's put it back” => can you please say “search index” instead of “index”?
  • test "replication_desired reports #{ask or 'default'} if redundancy is #{ask}" — should this be using the word “redundancy” in here? I think it should be "replication_desired reports #{ask or 'default'} if desired is #{ask}”?
  • test "replication_confirmed* cannot be set by non-admin user" do
    Can you update this comment: “# Cannot set both at once.” => “Cannot set both at once either.”
  • test “don't clear replication_confirmed* when just deleting a data block” seems to have done what you intended. Verified as below in the test:
Initial manifest:  . acbd18db4cc2f85cedef654fccc4a4d8+3 37b51d194a7513e45b56f6524f2d51f2+3 0:3:foo 3:6:bar
Updated manifest:  . 37b51d194a7513e45b56f6524f2d51f2+3 0:3:foo
  • In the fixtures, can we make the names more readable? ex:
    • “replication wantnull havenull” => “replication want-null have-null”
    • “replication want2 have2” => “replication want-2 have-2” or “replication want 2 and have 2”

#15 Updated by Tom Clegg almost 5 years ago

Radhika Chippada wrote:

  • I am understanding this implementation as follows. For each collection, replication_desired can be set (by admin); or the default will be used. Whatever is this value (default or per collection), we will initiate the replication process when the replication_confirmed != replication_desired. My comments are based on this understanding. Please correct if this is not the case.

The gist of it is that the user specifies a desired replication level, while the data manager gives feedback about the current replication state as often as practicable, and makes additional copies whenever needed. This way the user can check whether the desired level has been achieved, detect when replication falls (even temporarily) below the desired level, and be assured that the data manager process isn't simply stopped or broken.

Given that storage devices and nodes can fail (and return to service) without notice, an assurance about the current replication level is somewhere between impractical and unbelievable. However, a claim about the most recent verification event is practical and believable.

Likewise, the process of comparing desired and actual states has to be done regularly whether or not anyone changes replication_desired on any collections. Data manager could prioritize cases where the latest confirmed level was lower than the current desired level (or was null), though.

  • I think the name “replication_current” might have been clearer over “replication_confirmed”.

I think "current" implies (incorrectly, see above) that the value represents the state at the time of the collections.get request. Also, "confirmed" highlights the distinction between "specified/desired" (which is always "current") and a confirmation event (which happened in the past). I'm certainly willing to consider other terms ("verified"?), but I do think whatever we choose should match the timestamp field in an obvious way ("replication_current_at" doesn't quite sound right).

  • when replication_desired is changed, do we need to reset replication_confirmed and replication_confirmed_at? I am thinking this can be set by admin after the initial collection creation and replication_confirmed etc are completed. Or can be updated after setting to a value previously. In either case, I am assuming we would need to replicate additional copies or remove replications as needed. Does this get initiated because replication_desired!= replication_confirmed or do we need to reset these attributes to initiate the replication process?

I think if a user increases replication_desired from 2 to 3, the information about when replication was last confirmed (and whether it was confirmed to be 1 or 2) is still correct, and still just as useful.

It is possible (but not certain) that additional copies of data blocks will be made in order to bring actual replication from 2 to 3.

  • maybe_clear_redundancy_confirmed - should this be using the word “replication” in place of “redundancy”?

Ah, yes. Fixed in 08c575d.

  • ensure_permission_to_save method

Why do we want non-admin users be able to clear replication_confirmed* when they cannot set them? Should the following be:

[...]

I figured this restriction would amount to an inconvenience at most: a non-admin user can achieve the same result by adding an arbitrary data block to the manifest (causing the confirmed fields to reset to nil) and then removing it.

  • config/application.default.yml
    can we update comment — “Set replication level for collections whose replication_desired attribute is nil” => “Default replication level for collections. This will be used when replication_desired attribute is nil.”

Done.

  • migration script -> comment in up method
    “Removing that column dropped the index. Let's put it back” => can you please say “search index” instead of “index”?

Done.

  • test "replication_desired reports #{ask or 'default'} if redundancy is #{ask}" — should this be using the word “redundancy” in here? I think it should be "replication_desired reports #{ask or 'default'} if desired is #{ask}”?

Hm... I already did this in d62b733, which adds a bunch of other tests too. Which version were you looking at?

  • test "replication_confirmed* cannot be set by non-admin user" do
    Can you update this comment: “# Cannot set both at once.” => “Cannot set both at once either.”

Done.

  • test “don't clear replication_confirmed* when just deleting a data block” seems to have done what you intended. Verified as below in the test:

[...]

I don't quite follow you here. I agree that it seems to have done what I intended. (Hopefully it's not unique in this respect?)

  • In the fixtures, can we make the names more readable? ex:
    • “replication wantnull havenull” => “replication want-null have-null”
    • “replication want2 have2” => “replication want-2 have-2” or “replication want 2 and have 2”

Changed to want=2, etc.

Now at 1f7a6b5

#16 Updated by Radhika Chippada almost 5 years ago

  • I understand replication strategy better after your first response above. Also agree with your comments about "confirmed" versus "current."
  • I was looking at d302307. And, I missed the sdk updates due to this. A couple more comments now that I looked at them (below).
  • commands/put.py
    • This comment is still a bit confusing to me.
      “args.replication is how many copies we instruct Arvados to maintain (by passing it in collections().create() after all data is written).” Are you saying “args.replication is how many copies we instruct Arvados to maintain after all data is written (by passing it in collections().create()).”?
    • This comment can be a bit more elaborate?
      “If args.replication is given as None, it should stay None, but we still need to write a suitable number of copies to Keep.” Something like “If args.replication is given as None, replication_desired would be none. However, we will write a default number of copies to Keep …” ?
  • Is this still staying in put.py?
    if api_client._schema.schemas['Collection']['properties'].get(replication_attr, None) is None: # API calls it 'redundancy' until #3410.
    replication_attr = 'redundancy'
  • Regarding test “don't clear replication_confirmed* when just deleting a data block”
    You had a comment in the test “We really deleted a block there, right?” I was responding to that comment with my observation that the test is indeed doing what it intends to do.
  • I did notice that you had not yet merged latest master (and full text search). I am glad you merged and took care of updating that index as well.
  • ensure_permission_to_save
    Ah. I see the problem now. If we do what I suggested the scenario "clear replication_confirmed* when introducing a new block in manifest" fails which is not acceptable.
  • I noticed this issue with the db migration script. When we rollback, the redundancy_confirmed_by_client_uuid is added back and is the last column in the table. Hence the search_index needs to list it as the last column.
    [ 24/260] ArvadosModelTest#test_search_index_exists_on_models_that_go_into_projects = 0.05 s                                         
      1) Failure:
    ArvadosModelTest#test_search_index_exists_on_models_that_go_into_projects [/home/radhika/arvados/services/api/test/unit/arvados_model_test.rb:144]:
    collections has no search index with columns ["owner_uuid", "modified_by_client_uuid", "modified_by_user_uuid",
    "portable_data_hash", "uuid", "name", "file_names", "redundancy_confirmed_by_client_uuid"].
    Instead found search index with columns ["owner_uuid", "modified_by_client_uuid", "modified_by_user_uuid",
    "portable_data_hash", "redundancy_confirmed_by_client_uuid", "uuid", "name", "file_names"]
    

#17 Updated by Tom Clegg almost 5 years ago

Radhika Chippada wrote:

“args.replication is how many copies we instruct Arvados to maintain (by passing it in collections().create() after all data is written).” Are you saying “args.replication is how many copies we instruct Arvados to maintain after all data is written (by passing it in collections().create()).”?

Changed to:

    # write_copies diverges from args.replication here.
    # args.replication is how many copies we will instruct Arvados to
    # maintain (by passing it in collections().create()) after all
    # data is written -- and if None was given, we'll use None there.
    # Meanwhile, write_copies is how many copies of each data block we
    # write to Keep, which has to be a number.
    #
    # If we simply changed args.replication from None to a default
    # here, we'd end up erroneously passing the default replication
    # level (instead of None) to collections().create().
  • Is this still staying in put.py?
    if api_client._schema.schemas['Collection']['properties'].get(replication_attr, None) is None:
    # API calls it 'redundancy' until #3410.
    replication_attr = 'redundancy'

Yes, at least for a little while: it allows new Python clients to work against old API servers, e.g., a compute node or shell VM updates to this version before its API server does. I updated the comment from "calls" to "called", though, since this is #3410...

  • Regarding test “don't clear replication_confirmed* when just deleting a data block”
    You had a comment in the test “We really deleted a block there, right?” I was responding to that comment with my observation that the test is indeed doing what it intends to do.

Ah, I see. I meant it as "this is what the next line does" but it can just as well scan as a question to amuse the reader. I've rephrased it:

-      # We really deleted a block there, right?
+
+      # Confirm that we did just remove a block from the manifest (if
+      # not, this test would pass without testing the relevant case):
  • I noticed this issue with the db migration script. When we rollback, the redundancy_confirmed_by_client_uuid is added back and is the last column in the table. Hence the search_index needs to list it as the last column.
    [...]

Indeed! Updated the index. Also updated the test to be less sensitive, since (AFAIK) it doesn't actually matter whether the columns are given in the same order in the table and the index -- only that the set of columns is correct. Right?

Now at 11e1ee (!)

#18 Updated by Tom Clegg almost 5 years ago

  • Status changed from New to In Progress

#19 Updated by Anonymous almost 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:d87717b4ec885059183ef6d7fa6780c343338455.

Also available in: Atom PDF