Project

General

Profile

Actions

Idea #14484

closed

[API Server] Return collection size and number of files in collection record

Added by Tom Morris about 6 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Eric Biagiotti
Category:
-
Target version:
Start date:
03/28/2019
Due date:
Story points:
1.0
Release relationship:
Auto

Description

Add database columns (file_count, file_size_total?)

Update in a migration (avoid updating too many rows in a single transaction -- see #13752, af1125bd1)

Use sum of listed file sizes (as opposed to sum of sizes of distinct blocks, etc.)


Subtasks 1 (0 open1 closed)

Task #14897: Review 14484-collection-record-updateClosedLucas Di Pentima03/28/2019Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Idea #15093: Work with Ops to decide best DB migration strategy for collection file count & sizeResolvedWard VandewegeActions
Actions #1

Updated by Tom Morris about 6 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Tom Clegg about 6 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz almost 6 years ago

  • Status changed from In Progress to New
  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 1.0
Actions #4

Updated by Tom Morris almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2019-03-13 Sprint
Actions #5

Updated by Eric Biagiotti almost 6 years ago

  • Assigned To set to Eric Biagiotti
Actions #6

Updated by Tom Morris almost 6 years ago

  • Release set to 15
Actions #7

Updated by Eric Biagiotti almost 6 years ago

  • Target version changed from 2019-03-13 Sprint to 2019-03-27 Sprint
Actions #8

Updated by Eric Biagiotti almost 6 years ago

  • Status changed from New to In Progress
Actions #9

Updated by Eric Biagiotti almost 6 years ago

Approach:

  • Create a migration that will create the columns file_count, file_size_total and populate them with the number of files in the collection (integer DEFAULT 0 NOT NULL) and the sum of the files listed sizes (decimal DEFAULT 0.0 NOT NULL) respectively.
  • Update models/collection.rb to include the new columns
  • When files are added/removed, update the collections file_count and file_size_total columns.
  • Add corresponding INSERT to schema_migrations table in structure.sql
  • Create/Modify tests
  • Update documentation

Questions:

  • During the migration, I can get the files and their sizes from the manifest text. Do I have to parse this manually, or is there an object I should be instantiating?
  • When files are added to a collection, where in the code does the manifest get altered? I'm assuming this is the same place that I want to update the [file_count, file_size_total] columns.
Actions #10

Updated by Lucas Di Pentima almost 6 years ago

Eric Biagiotti wrote:

  • Create a migration that will create the columns file_count, file_size_total and populate them with the number of files in the collection (integer DEFAULT 0 NOT NULL) and the sum of the files listed sizes (decimal DEFAULT 0.0 NOT NULL) respectively.

File size should be in bytes so I suppose we should be using BIGINT instead of DECIMAL?

  • Add corresponding INSERT to schema_migrations table in structure.sql

Check out docs about migrations on rails. The structure.sql file is auto-generated from the migrations. Basically you can write a schema-changing migration first (adding the 2 cols with their default values) and another that does a general update. Then, you run rake db:migrate to apply them and generate the updated structure.sql file.

  • Update models/collection.rb to include the new columns
  • When files are added/removed, update the collections file_count and file_size_total columns.
  • During the migration, I can get the files and their sizes from the manifest text. Do I have to parse this manually, or is there an object I should be instantiating?
  • When files are added to a collection, where in the code does the manifest get altered? I'm assuming this is the same place that I want to update the [file_count, file_size_total] columns.

I think the correct strategy would be to write a callback on collection's create & update operations checking if manifest_text changed. When this happens, you scan the new manifest to get the values. This scan code should be added to the Ruby SDK, if it's not already there. This code can be also used from the data migration.

The API Server doesn't support "add/modify file" operations. Those ops are done client-side and then the new resulting manifest is updated on the API server, so that's why you should pay attention when the manifest text changes.

Actions #11

Updated by Peter Amstutz almost 6 years ago

See https://guides.rubyonrails.org/v4.2/active_record_migrations.html

The magic command is (I have to look this one up every time):

$ bin/rails generate migration TheNameOfTheMigration

A regular 64 bit integer should be fine for file_size_total, no need to use specialized numeric types like decimal or bigint.

You want an active record hook in the Collection model, see https://guides.rubyonrails.org/v4.2/active_record_callbacks.html

You only need to recompute file sizes/counts when "self.manifest_text.changed?"

Use Arv::Collection to parse the manifest text. Then you should be able to iterate over the files.

The migration SHOULD NOT use the Collection class, it should duplicate the logic for computing file counts/sizes (and/or share code in lib/). This is to prevent migrations (which by definition are run on databases created by older versions of the software) from breaking if the behavior of Collection class changes.

Actions #12

Updated by Eric Biagiotti over 5 years ago

  • Target version changed from 2019-03-27 Sprint to 2019-04-10 Sprint
Actions #13

Updated by Eric Biagiotti over 5 years ago

Latest at c5c82ef67b9dc3cb3619e2bef3a86b9b0f0912e8

  • Added database migration for adding file_count and file_size_total to the collections table.
  • Added logic and test for grouping pdhs by manifest size to the Container model.
  • Added file_count and file_size_total to the Collection model.

Unit Tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1158/
Integration/Conformance: https://ci.curoverse.com/view/CWL/job/arvados-cwl-conformance-tests/74/console

To verify the migration is working correctly, I just added the following to the end of the up function and ran rake db:migrate:redo STEP=1 RAILS_ENV=test

    collections = ActiveRecord::Base.connection.exec_query(
      'SELECT DISTINCT portable_data_hash, manifest_text, file_count, file_size_total FROM collections ORDER BY portable_data_hash'
    )
    collections.rows.each do |c|
      print c, "\n" 
    end
    print(collections.rows.count)
Actions #15

Updated by Lucas Di Pentima over 5 years ago

Some comments & questions:

  • Why group_pdhs_by_* funcs are part of the Container model? They seem to be more related to Collections as they work with PDHs & manifests, but even then, I think they don’t use any of the model’s facilities, why not putting them in some separate lib? (didn’t follow the discussion on chat)
  • There’s a typo in a migration comment: “…all the distince pdhs greater…"
  • Can you explain the reasoning behind 6709876170511ade8e24fe60bf77da24bc4a03d4 ? I believe that if a manifest isn’t valid it should not reach set_file_count_and_total_size() because before_validation :check_manifest_validity, right?
  • Could you add collection's model & controller test for this new feature?
Actions #16

Updated by Lucas Di Pentima over 5 years ago

Also, please make sure that the new fields are read-only.

Actions #17

Updated by Eric Biagiotti over 5 years ago

Lucas Di Pentima wrote:

  • There’s a typo in a migration comment: “…all the distince pdhs greater…"
  • Why group_pdhs_by_* funcs are part of the Container model? They seem to be more related to Collections as they work with PDHs & manifests, but even then, I think they don’t use any of the model’s facilities, why not putting them in some separate lib? (didn’t follow the discussion on chat)

I agree on this. The conclusion of the discussion was that if it is container logic, make a static function on the Container model instead of a lib, even though we try not to use models in migrations. I don't think this is really container logic though.

I moved it and fixed the typo in 59a1fc872723c0bafa9764b95756723f54419631.

Still working on your other comments.

Actions #18

Updated by Eric Biagiotti over 5 years ago

  • Can you explain the reasoning behind 6709876170511ade8e24fe60bf77da24bc4a03d4 ? I believe that if a manifest isn’t valid it should not reach set_file_count_and_total_size() because before_validation :check_manifest_validity, right?

This was a bad attempt at me trying to get tests working. Some tests skip validation, so set_file_count_and_total_size was getting called with invalid manifests. I think the more appropriate fix is to call set_file_count_and_total_size as a part of the after_validation phase. This will ensure that we have a valid manifest when trying to calculate file count and total size.

Fixed in fc636d5e169d944981ce2951e05d59fad04563a3

  • Could you add collection's model & controller test for this new feature?

Done

Also, please make sure that the new fields are read-only.

Done. Any attempt to change the file attributes are overridden with the calculated values from the manifest.

Latest at 79316b0ebbf5bfe934267579478b89f770c0e5ba
Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1168/

Actions #19

Updated by Lucas Di Pentima over 5 years ago

Some comments:

  • On test/unit/collection_test.rb, what I think you’re trying to do are collection updates, right? lines 89 & 96 are creating new collections with the passed manifest text, but the comments say “Changing…” If you really meant to make those creations, it would be nice to also check that an update also updates the stats.
  • When referencing objects on the fixture I think it’s nice not to hardcode their UUIDs on the test, but get them via the helper functions by their fixture name, for example: collections(:fixture_name).uuid
  • On models/collection.rb, line 202: I think it's re-computing file sizes & count if a client attempts to change those values. Instead, I think it’s cheaper to reassign them to their previous values if they changed but the manifest text didn’t. You can access the old value by using for example: self.file_count_was
Actions #20

Updated by Eric Biagiotti over 5 years ago

Lucas Di Pentima wrote:

Some comments:

  • On test/unit/collection_test.rb, what I think you’re trying to do are collection updates, right? lines 89 & 96 are creating new collections with the passed manifest text, but the comments say “Changing…” If you really meant to make those creations, it would be nice to also check that an update also updates the stats.
  • When referencing objects on the fixture I think it’s nice not to hardcode their UUIDs on the test, but get them via the helper functions by their fixture name, for example: collections(:fixture_name).uuid
  • On models/collection.rb, line 202: I think it's re-computing file sizes & count if a client attempts to change those values. Instead, I think it’s cheaper to reassign them to their previous values if they changed but the manifest text didn’t. You can access the old value by using for example: self.file_count_was

All of the above addressed in 425836b285a32c31ef643f8c5d4b48b8b42b7ac4. I also added a collection controller test for updating a collection with manifest and file stats at the same time in a5cd06261d3ef5005c3bd921c610abfa21dc672f

Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1173/

Actions #21

Updated by Lucas Di Pentima over 5 years ago

This LGTM, please merge. Thanks!

Actions #22

Updated by Eric Biagiotti over 5 years ago

  • Status changed from In Progress to Resolved
Actions #23

Updated by Tom Morris over 5 years ago

  • Related to Idea #15093: Work with Ops to decide best DB migration strategy for collection file count & size added
Actions

Also available in: Atom PDF