Project

General

Profile

Actions

Bug #20533

closed

Better handling of request surges when canceling a large workflow

Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
-
Story points:
-
Release relationship:
Auto

Description

Specific test case: running a workflow with 100s of containers and then canceling them all at once leads to a massive surge of requests to the API server as all the containers finalize all at once.

Want to test ways that we can mitigate this traffic surge so that:

  1. all the containers finalize without fatal 503 errors (#20540, #20541)
  2. the workbench remains responsive (at least for GET requests during this time)
    1. evaluate configuration changes
    2. load balancing #20539
    3. controller request prioritization #20602
    4. Send out cancellations at a slower rate than whatever it's doing right now

Related issues

Related to Arvados Epics - Idea #20599: Scaling to 1000s of concurrent containersResolved06/01/202303/31/2024Actions
Related to Arvados - Idea #20602: Prioritize requests made by workbench 2ResolvedTom Clegg06/08/2023Actions
Actions #1

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Future to To be scheduled
Actions #5

Updated by Peter Amstutz over 1 year ago

  • Target version changed from To be scheduled to Development 2023-06-07
Actions #6

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
  • Subject changed from Better handling of request surges to Better handling of request surges when canceling a large workflow
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Related to Idea #20599: Scaling to 1000s of concurrent containers added
Actions #10

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #11

Updated by Peter Amstutz over 1 year ago

  • Related to Idea #20602: Prioritize requests made by workbench 2 added
Actions #12

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #13

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-06-07 to Future
Actions #14

Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress
Actions #15

Updated by Peter Amstutz over 1 year ago

  • Status changed from In Progress to Feedback
Actions #16

Updated by Peter Amstutz over 1 year ago

I ran a test with 250 containers and hitting cancel. The request queue does immediately fill up with requests as the containers try to finalize, but

  1. Workbench remains responsive (very important)
  2. I believe all the crunch-run processes retry and terminate gracefully (but a bug I thought I fixed might still be a think: https://dev.arvados.org/issues/20614#note-14)
Actions #17

Updated by Peter Amstutz over 1 year ago

  • Release set to 66
  • Target version deleted (Future)
  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF