Feature #17574

keep-balance updates API server with correct storage_classes_confirmed

Added by Peter Amstutz 5 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
07/29/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Subtasks

Task #17934: Review 17574-storage-classes-confirmedResolvedTom Clegg


Related issues

Related to Arvados Epics - Story #16107: Storage classesNew03/01/202109/30/2021

Associated revisions

Revision d7e76aa4
Added by Tom Clegg about 2 months ago

Merge branch '17574-storage-classes-confirmed' into main

closes #17574

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Peter Amstutz 5 months ago

#2 Updated by Peter Amstutz 2 months ago

  • Target version set to 2021-08-04 sprint
  • Assigned To set to Tom Clegg

#3 Updated by Tom Clegg 2 months ago

  • Status changed from New to In Progress

#4 Updated by Tom Clegg 2 months ago

17574-storage-classes-confirmed @ e75d90299dc2ff0dbccf28c134ee2640f79b2b5d -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2615/

After a balancing iteration, keep-balance updates replication_confirmed, replication_confirmed_at, storage_classes_confirmed, storage_classes_confirmed_at fields in the collections table.

  • replication_confirmed is clamped at replication_desired (or, if that's null, DefaultReplication) -- avoids excessive updates when block replication increases temporarily due to rendezvous shuffling or higher-replication collections using the same blocks
  • the ..._confirmed_at timestamps are updated only when the ..._confirmed fields change -- avoids excessive updates when nothing meaningful is changing
  • all updates are done directly to the database rather than going through controller/railsAPI -- much faster, and avoids messing with the modified_at/updated_at timestamps and writing a huge number of rows in the logs table (note this means no websocket notification event for these updates either)
  • now that keep-balance has database access, it seemed like a good time to update the "fetch all collections" code to use one long postgresql select query instead of getting pages from controller/railsAPI and going through some agony to avoid missing any
    • with that change, the bottleneck changed from "get collections" to "parse manifests"
    • profiling revealed the manifest parsing code spent 70% of its time allocating/copying/GCing strings; replacing the string operations with byte-slice operations made it waay faster
    • testing on su92l (766K rows) "get all collections" took 2000s before, 90s after this change; now the bottleneck is getting block lists from keepstores (250s)
  • applying the updates is a bit slow: about 40 updates/s on a 2-core machine, 120 updates/s on an 8-core machine. They are committed in batches of 100 updates per transaction; doing them individually with autocommit was about half as fast
  • there's a configurable limit (Collections.BalanceUpdateLimit) for # updates per iteration, default 100K. A site with >100K collections will need multiple passes to apply all the updates after upgrading to this version -- and thanks to the "get all collections" speedup, even when it's doing 100K updates this version will still complete a cycle faster than the old version

#5 Updated by Peter Amstutz about 2 months ago

By 40 updates/s do you mean 40 collections per second or 40 batches of 100 (4000 collections/s) ?

#6 Updated by Lucas Di Pentima about 2 months ago

  • At file balance.go:L436 we use all the available cores to parse collection manifests and I'm assuming keep-balance is usually run in the same node as other services (ie: keepproxy, as recommended by the docs), should we leave some cores for other off-process purposes? or maybe add a config knob so that the admin can tune it?
  • I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?
  • Each run keep-balance produces a replication level distribution chart, do you think it would be useful to do the same for storage classes?

#7 Updated by Tom Clegg about 2 months ago

Peter Amstutz wrote:

By 40 updates/s do you mean 40 collections per second or 40 batches of 100 (4000 collections/s) ?

40 collections per second.

#8 Updated by Tom Clegg about 2 months ago

Lucas Di Pentima wrote:

  • At file balance.go:L436 we use all the available cores to parse collection manifests and I'm assuming keep-balance is usually run in the same node as other services (ie: keepproxy, as recommended by the docs), should we leave some cores for other off-process purposes? or maybe add a config knob so that the admin can tune it?

Leaving CPUs for other processes seems like a good idea. But I kinda think "nice" in the systemd unit file would be a better way...

  • I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?

Hm, that sounds like a bug, or at least misleading. I'm guessing it left you with replication_confirmed==0, storage_classes_confirmed==["bogus"]? In that case we should probably change that to replication_confirmed==0, storage_classes_confirmed==[] or null.

  • Each run keep-balance produces a replication level distribution chart, do you think it would be useful to do the same for storage classes?

Probably, but it's not obvious (at least to me) what would be the most useful report... maybe we need a more general version of #15758 (trash reporting) where we can get ideas together. IOW not inclined to do it as part of this issue.

#9 Updated by Lucas Di Pentima about 2 months ago

Tom Clegg wrote:

  • I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?

Hm, that sounds like a bug, or at least misleading. I'm guessing it left you with replication_confirmed==0, storage_classes_confirmed==["bogus"]? In that case we should probably change that to replication_confirmed==0, storage_classes_confirmed==[] or null.

Yes, that was what happened, replication_confirmed was left at 0, makes sense to leave storage_classes_confirmed as empy list or null.

#10 Updated by Tom Clegg about 2 months ago

17574-storage-classes-confirmed @ de030f2c3f7a0c57904def9662d4d7a979b90497 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2616/
  • sets storage_classes_confirmed to [] when setting replication_confirmed to 0.

#11 Updated by Lucas Di Pentima about 2 months ago

LGTM, thanks!

#12 Updated by Peter Amstutz about 2 months ago

  • Target version changed from 2021-08-04 sprint to 2021-08-18 sprint

#13 Updated by Tom Clegg about 2 months ago

  • Status changed from In Progress to Resolved

#14 Updated by Peter Amstutz about 2 months ago

  • Target version changed from 2021-08-18 sprint to 2021-08-04 sprint

Also available in: Atom PDF