Project

General

Profile

Actions

Bug #21189

closed

Timeout sending extremely large trash list

Added by Peter Amstutz 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
-
Release relationship:
Auto

Description

Nov 13 09:23:16 ip-172-25-144-215 keep-balance60515: {"ClusterID":"xngs1","PID":60515,"level":"info","msg":"31737641 replicas (31737641 blocks, 2032657094645695 bytes) garbage (have\u003ewant=0, old)","time":"2023-11-13T09:23:16.806950720Z"}

Nov 13 11:02:24 ip-172-25-144-215 keep-balance60515: {"ClusterID":"xxxxx","PID":60515,"level":"info","msg":"xxxxx-bi6l4-328af1f189592368 (172.25.144.172:25107, disk): send trash list: PUT http://172.25.144.172:25107/trash giving up after 1 attempt(s): Put \"http://172.25.144.172:25107/trash\": context deadline exceeded","time":"2023-11-13T11:02:24.755687707Z"}

Nov 13 11:02:24 ip-172-25-144-215 keep-balance60515: {"ClusterID":"xxxxx","PID":60515,"level":"info","msg":"send_trash_lists: took 5932.531866437s","time":"2023-11-13T11:02:24.755744954Z"}


Subtasks 1 (0 open1 closed)

Task #21193: Review 21189-changeset-limitResolvedTom Clegg11/15/2023Actions
Actions #1

Updated by Peter Amstutz 6 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 6 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 6 months ago

  • Release set to 67
Actions #4

Updated by Tom Clegg 6 months ago

  • Status changed from New to In Progress

You may be able to resolve this by increasing the BalanceTimeout config (default 6h).

We could also add a configurable limit for the trash/pull list size, but I'm not sure that would be more efficient in the long run.

Actions #5

Updated by Peter Amstutz 6 months ago

Tom Clegg wrote in #note-4:

You may be able to resolve this by increasing the BalanceTimeout config (default 6h).

We could also add a configurable limit for the trash/pull list size, but I'm not sure that would be more efficient in the long run.

So we changed BalanceTimeout from 6h to 12h and it finished the balance run, but the new problem is that when it sent the trash list it filled up the entire 8 GiB of the keepstore node and keepstore got OOM killed.

We could up-size the keepstore node although I don't know exactly how big it needs to go. I think it would be a good idea to add a limit to the pull list/trash list size.

Actions #6

Updated by Peter Amstutz 6 months ago

  • Assigned To set to Tom Clegg
Actions #7

Updated by Tom Clegg 6 months ago

21189-changeset-limit @ e98eb84cc8f860bfad29b2b79e72603fccad673c -- developer-run-tests: #3905 retry developer-run-tests-remainder: #4104

  • All agreed upon points are implemented / addressed.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • ✅ automated tests updated
  • Documentation has been updated.
    • ✅ new config keys are documented
    • ☐ upgrade notes could say "if you already do more than 100K pulls/trashes per iteration successfully then you could increase this limit" ... or maybe that's not likely enough to mention?
  • Behaves appropriately at the intended scale (describe intended scale).
    • If a cluster needs (and can implement) more than 100K trash/pull ops per keep-balance cycle, this default config would create a bottleneck. This will show up as non-zero "Deferred" counters in the log messages.
    • ☐ Should we add "deferred pulls/trashes" metrics to support prometheus alerts?
  • Considered backwards and forwards compatibility issues between client and server.
    • n/a
  • Follows our coding standards and GUI style guidelines.

This also retires the "disable pulls/trashes" commmand line flags as discussed in #21126#note-13, #21126#note-15. If the command line has -commit-trash[=true] then the config file is obeyed. If the command line has -commit-trash=false keep-balance exits EX_USAGE and that tells systemd not to auto-restart.

Actions #8

Updated by Peter Amstutz 6 months ago

      # keep-balance. Larger values allow more progress per
      # keep-balance iteration. Setting both limits to zero amounts
      # to a "dry run".

Describing the zero case as "dry run" isn't descriptive enough, it should be something like "a value of zero means that nothing will be sent, with the effect that the pull or trash list will be computed but not acted on."

            logger.Print("WARNING: Will scan periodically, but no changes will be committed.")
            logger.Print("=======  Consider using non-zero BalancePullLimit and BalanceTrashLimit configs.")

"Consider using ..." is sort of a weird phrasing, it would be more direct to say "Set BalancePullLimit or BalanceTrashLimit to a value greater than zero to commit changes."

Rest LGTM.

Actions #9

Updated by Tom Clegg 5 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF