Project

General

Profile

Actions

Feature #17574

closed

keep-balance updates API server with correct storage_classes_confirmed

Added by Peter Amstutz about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Story points:
-
Release relationship:
Auto

Subtasks 1 (0 open1 closed)

Task #17934: Review 17574-storage-classes-confirmedResolvedTom Clegg07/29/2021Actions

Related issues

Related to Arvados Epics - Idea #16107: Storage classesResolved03/01/202109/30/2021Actions
Actions #1

Updated by Peter Amstutz about 3 years ago

Actions #2

Updated by Peter Amstutz almost 3 years ago

  • Target version set to 2021-08-04 sprint
  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg almost 3 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Tom Clegg over 2 years ago

17574-storage-classes-confirmed @ e75d90299dc2ff0dbccf28c134ee2640f79b2b5d -- developer-run-tests: #2615

After a balancing iteration, keep-balance updates replication_confirmed, replication_confirmed_at, storage_classes_confirmed, storage_classes_confirmed_at fields in the collections table.

  • replication_confirmed is clamped at replication_desired (or, if that's null, DefaultReplication) -- avoids excessive updates when block replication increases temporarily due to rendezvous shuffling or higher-replication collections using the same blocks
  • the ..._confirmed_at timestamps are updated only when the ..._confirmed fields change -- avoids excessive updates when nothing meaningful is changing
  • all updates are done directly to the database rather than going through controller/railsAPI -- much faster, and avoids messing with the modified_at/updated_at timestamps and writing a huge number of rows in the logs table (note this means no websocket notification event for these updates either)
  • now that keep-balance has database access, it seemed like a good time to update the "fetch all collections" code to use one long postgresql select query instead of getting pages from controller/railsAPI and going through some agony to avoid missing any
    • with that change, the bottleneck changed from "get collections" to "parse manifests"
    • profiling revealed the manifest parsing code spent 70% of its time allocating/copying/GCing strings; replacing the string operations with byte-slice operations made it waay faster
    • testing on su92l (766K rows) "get all collections" took 2000s before, 90s after this change; now the bottleneck is getting block lists from keepstores (250s)
  • applying the updates is a bit slow: about 40 updates/s on a 2-core machine, 120 updates/s on an 8-core machine. They are committed in batches of 100 updates per transaction; doing them individually with autocommit was about half as fast
  • there's a configurable limit (Collections.BalanceUpdateLimit) for # updates per iteration, default 100K. A site with >100K collections will need multiple passes to apply all the updates after upgrading to this version -- and thanks to the "get all collections" speedup, even when it's doing 100K updates this version will still complete a cycle faster than the old version
Actions #5

Updated by Peter Amstutz over 2 years ago

By 40 updates/s do you mean 40 collections per second or 40 batches of 100 (4000 collections/s) ?

Actions #6

Updated by Lucas Di Pentima over 2 years ago

  • At file balance.go:L436 we use all the available cores to parse collection manifests and I'm assuming keep-balance is usually run in the same node as other services (ie: keepproxy, as recommended by the docs), should we leave some cores for other off-process purposes? or maybe add a config knob so that the admin can tune it?
  • I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?
  • Each run keep-balance produces a replication level distribution chart, do you think it would be useful to do the same for storage classes?
Actions #7

Updated by Tom Clegg over 2 years ago

Peter Amstutz wrote:

By 40 updates/s do you mean 40 collections per second or 40 batches of 100 (4000 collections/s) ?

40 collections per second.

Actions #8

Updated by Tom Clegg over 2 years ago

Lucas Di Pentima wrote:

  • At file balance.go:L436 we use all the available cores to parse collection manifests and I'm assuming keep-balance is usually run in the same node as other services (ie: keepproxy, as recommended by the docs), should we leave some cores for other off-process purposes? or maybe add a config knob so that the admin can tune it?

Leaving CPUs for other processes seems like a good idea. But I kinda think "nice" in the systemd unit file would be a better way...

  • I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?

Hm, that sounds like a bug, or at least misleading. I'm guessing it left you with replication_confirmed==0, storage_classes_confirmed==["bogus"]? In that case we should probably change that to replication_confirmed==0, storage_classes_confirmed==[] or null.

  • Each run keep-balance produces a replication level distribution chart, do you think it would be useful to do the same for storage classes?

Probably, but it's not obvious (at least to me) what would be the most useful report... maybe we need a more general version of #15758 (trash reporting) where we can get ideas together. IOW not inclined to do it as part of this issue.

Actions #9

Updated by Lucas Di Pentima over 2 years ago

Tom Clegg wrote:

  • I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?

Hm, that sounds like a bug, or at least misleading. I'm guessing it left you with replication_confirmed==0, storage_classes_confirmed==["bogus"]? In that case we should probably change that to replication_confirmed==0, storage_classes_confirmed==[] or null.

Yes, that was what happened, replication_confirmed was left at 0, makes sense to leave storage_classes_confirmed as empy list or null.

Actions #10

Updated by Tom Clegg over 2 years ago

17574-storage-classes-confirmed @ de030f2c3f7a0c57904def9662d4d7a979b90497 -- developer-run-tests: #2616
  • sets storage_classes_confirmed to [] when setting replication_confirmed to 0.
Actions #11

Updated by Lucas Di Pentima over 2 years ago

LGTM, thanks!

Actions #12

Updated by Peter Amstutz over 2 years ago

  • Target version changed from 2021-08-04 sprint to 2021-08-18 sprint
Actions #13

Updated by Tom Clegg over 2 years ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Peter Amstutz over 2 years ago

  • Target version changed from 2021-08-18 sprint to 2021-08-04 sprint
Actions #15

Updated by Peter Amstutz over 2 years ago

  • Release set to 42
Actions

Also available in: Atom PDF