Project

General

Profile

Actions

Feature #21123

closed

Add API that returns current dispatch/scheduling status for a specified container

Added by Peter Amstutz about 1 year ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
2.0
Release:
Release relationship:
Auto

Description

GET /arvados/v1/container_requests/{uuid}/container_status

Controller endpoint that checks for permission to read container request, looks up container, and proxies container status request to dispatcher (using the management token).

Container status includes a field describing the last decision made about that container -- booting an instance, in AtQuota state, in AtCapacity state, at maxConcurrency limit, etc.


Subtasks 1 (0 open1 closed)

Task #21560: Review 21123-scheduling-statusResolvedTom Clegg03/15/2024Actions

Related issues 3 (2 open1 closed)

Related to Arvados - Idea #21058: Let users know if Crunch hit cloud capacityNewActions
Related to Arvados - Idea #21542: Improved visibility on cloud instance (and maybe other resources?) quotasNewActions
Blocks Arvados - Feature #21297: use container_status API to display scheduling feedback on cloudResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz about 1 year ago

  • Related to Idea #21058: Let users know if Crunch hit cloud capacity added
Actions #2

Updated by Peter Amstutz about 1 year ago

  • Story points set to 2.0
  • Description updated (diff)
Actions #3

Updated by Tom Clegg about 1 year ago

  • Description updated (diff)
  • Subject changed from Cloud capacity backend to Add API that returns current dispatch/scheduling status for a specified container
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to Development 2024-02-14 sprint
Actions #5

Updated by Peter Amstutz about 1 year ago

  • Blocks Feature #21297: use container_status API to display scheduling feedback on cloud added
Actions #6

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #7

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #8

Updated by Peter Amstutz 10 months ago

  • Related to Idea #21542: Improved visibility on cloud instance (and maybe other resources?) quotas added
Actions #9

Updated by Peter Amstutz 10 months ago

  • Assigned To set to Tom Clegg
Actions #10

Updated by Tom Clegg 10 months ago

  • Status changed from New to In Progress

21123-scheduling-status @ 35d4eea994c98b4bb65111c9da6e57abbc7e014f -- developer-run-tests: #4068

  • All agreed upon points are implemented / addressed.
    • New controller API proxies to new dispatcher API and returns the scheduling status for a specified container.
    • As mentioned below, when the cluster is at capacity, there isn't a ton of detail about that (e.g., quota vs. shortage of specific instance type) but we do get to see the number of containers waiting ahead of this one, which (I think) will often be the more interesting information.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • TBD: where should we put our goalposts for revealing "why not more instances?", or, should we just leave it at this for now
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • There are automated tests. Rather than insert delays to guarantee each status can be observed, I've just tested that the observed sequence is a subset of the expected statuses, and that the steady state of a simple test scenario (2 containers can run, 2 containers wait for an unavailable instance type) is dependable. The dispatcher-side API test also dumps the observed sequence of statuses for the test containers, so it can be manually inspected.
  • Documentation has been updated.
    • Added dispatcher management API docs and arvados/v1/container_requests docs.
  • Behaves appropriately at the intended scale (describe intended scale).
    • Adds some memory and cpu overhead to each scheduler iteration, but I don't think it will be noticeable. The work of building a map of container UUID → scheduling info is a function of queue size, so that's limited to 1 per second regardless of number of incoming API calls.
  • Considered backwards and forwards compatibility issues between client and server.
    • If a new controller connects to an old dispatcher, the new API endpoint returns 404 and the controller returns "waiting for dispatch" to its client, which seems like the best answer in the circumstances.
  • Follows our coding standards and GUI style guidelines.

The dumped sequence shows a bit more churn than I expected. I think this is caused by the fact that the scheduler doesn't see the difference between "asked cloud to create an instance" and "waiting for instance to boot". After a burst of "create instance" calls, there's a brief interval where all of the relevant containers are in "waiting for instance to be ready" status, but then as the create calls fail and we realize that we're at quota, they get pushed back to "waiting for capacity, queue position 1", then perhaps "queue position 2". I think this is OK: it happens quickly, and hopefully rarely, and in the general case it's unavoidable anyway because other higher-priority containers can always jump the queue while a container is waiting for an instance to be ready.

There is definitely more to be done here wrt explaining why the indicated container (or the ones ahead of it in the queue) cannot be scheduled right now. I'm putting this up for review anyway because
  • even seeing the changing position in the queue, and the transitions from "cluster already full" to "waiting for an instance to be ready" to "preparing runtime environment" will be a huge improvement even if the nature of the cluster bottleneck is still opaque
  • merging this will unblock the Workbench side of the feature
Actions #11

Updated by Brett Smith 10 months ago

Tom Clegg wrote in #note-10:

21123-scheduling-status @ 35d4eea994c98b4bb65111c9da6e57abbc7e014f -- developer-run-tests: #4068

This implementation looks good to me. My comments are all about documentation or support stuff.

doc/api/dispatch.html.textile.liquid

+* a @scheduling_status@ entry: a brief explanation of the container's status in the dispatch queue, or empty if scheduling is not applicable, e.g., the container has already started running.

In order to better match the prevailing style of the existing list elements, this would be better written as:

* a @scheduling_status@ entry with a brief explanation of the container's status in the dispatch queue, or empty if scheduling is not applicable (e.g., the container has already started running).

I also wonder what "empty" means. Empty string? null? Does not appear? (That's what it looks like from the code.) Clarification would be helpful.

doc/api/methods/container_requests.html.textile.liquid

Similar comment about "empty" in the scheduling_status table row.

Tests

Have you done hundreds of runs of the new/modified tests to check for race conditions in the new code?

lib/dispatchcloud/dispatcher.go

I will throw this out only once: instead of sQueue throughout, consider schedQueue, or if that's too long, schedQue or schedQ. Just looking at dispatcher's other fields, s could reference sched, ssh, setup or stop. You really need to know the type of sQueue for the rest to make sense, and I don't think that will always be at hand for different development tasks.

lib/dispatchcloud/scheduler/run_queue.go

I think it would be a nice style improvement to have all the possible SchedulingStatus strings together. As a reviewer, it would've been nice to check the text for consistent style. I think it will also be an aid to future developers as we add or edit reasons.

I'm imagining them all in a const block, but the details matter less than the principle of them being all together and not sprinkled within a 350-ish-line function definition.

services/api/config/routes.rb

Should the new method be added to the discovery document (with documentation)?

Actions #12

Updated by Tom Clegg 10 months ago

Brett Smith wrote in #note-11:

In order to better match the prevailing style of the existing list elements, this would be better written as:

[...]

Fixed.

I also wonder what "empty" means. Empty string? null? Does not appear? (That's what it looks like from the code.) Clarification would be helpful.

Similar comment about "empty" in the scheduling_status table row.

Fixed with "the empty string".

Then thought about it some more and replaced those (and a few other instances) with "an empty string".

Have you done hundreds of runs of the new/modified tests to check for race conditions in the new code?

Yes.

I will throw this out only once: instead of sQueue throughout, consider schedQueue, or if that's too long, schedQue or schedQ. Just looking at dispatcher's other fields, s could reference sched, ssh, setup or stop. You really need to know the type of sQueue for the rest to make sense, and I don't think that will always be at hand for different development tasks.

Good point. Updated to schedQueue*.

I think it would be a nice style improvement to have all the possible SchedulingStatus strings together. As a reviewer, it would've been nice to check the text for consistent style. I think it will also be an aid to future developers as we add or edit reasons.

Good point. Added consts.

Should the new method be added to the discovery document (with documentation)?

Good point. Added to discovery doc by adding a stub method in rails. The documentation string is "container_status container_request" which is weird but that's what we put in such doc strings and there isn't yet a mechanism for specifying a useful string. Should I add that now? Or change the template to "see doc.arvados.org/appropriate-methods-page"?

21123-scheduling-status @ 536888e31706dfb9db43338e858529a7432a13c1 -- developer-run-tests: #4076

Actions #13

Updated by Tom Clegg 10 months ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #14

Updated by Tom Clegg 10 months ago

I went yak shaving (sdk/python tests don't run on python3.9 any more → upgrade dev vm to bookworm → python is confused → delete VENV3DIR in temp dir → lib/controller tests fail → use journalctl to find crunch-run logs revealing the problem is arv-mount is not installed) and came back with this fix, to save myself/others the trouble next time:

diff --git a/build/run-tests.sh b/build/run-tests.sh
index 28e9e1cf7b..9ed3d32ae2 100755
--- a/build/run-tests.sh
+++ b/build/run-tests.sh
@@ -1058,6 +1058,7 @@ install_deps() {
     do_install sdk/ruby-google-api-client
     do_install sdk/ruby
     do_install services/api
+    do_install services/fuse pip "${VENV3DIR}/bin/" 
     do_install services/keepproxy go
     do_install services/keep-web go
 }

21123-scheduling-status @ 95dba0428816e5b53cde4f8b15ae283619c51d54 -- developer-run-tests: #4077

Actions #15

Updated by Brett Smith 10 months ago

Tom Clegg wrote in #note-14:

I went yak shaving (sdk/python tests don't run on python3.9 any more → upgrade dev vm to bookworm → python is confused → delete VENV3DIR in temp dir → lib/controller tests fail → use journalctl to find crunch-run logs revealing the problem is arv-mount is not installed)

The fix seems fine but let's go back to that first part. I expect we'll be supporting Python 3.8 through at least the rest of the year and therefore personally I expect the tests to pass on Python 3.8 as well. And for me they do, except for #20909:

What next? test sdk/python:py3
[…]
Ran 936 tests in 306.861s

FAILED (failures=1, skipped=2)
Test failed: <unittest.runner.TextTestResult run=936 errors=0 failures=1>
error: Test failed: <unittest.runner.TextTestResult run=936 errors=0 failures=1>
======= sdk/python tests -- FAILED
======= test sdk/python -- 308s
Failures (1):
Fail: sdk/python tests (308s)
What next? [^D]
[…]
Leaving behind temp dirs in /home/brett/.cache/arvados-test
% ~/.cache/arvados-test/VENV3DIR/bin/python3 -V
Python 3.8.18

Can you please file a bug with what you were seeing?

Actions #16

Updated by Brett Smith 10 months ago

The branch LGTM. Everything below is small nits, take what you want and then merge either way:

Tom Clegg wrote in #note-12:

Brett Smith wrote in #note-11:

In order to better match the prevailing style of the existing list elements, this would be better written as:

[...]

Fixed.

This still has the "e.g." example without parentheses. This is not a big deal but literally the previous bullet has it in parentheses. Bouncing styles like this is unfortunate.

Have you done hundreds of runs of the new/modified tests to check for race conditions in the new code?

Yes.

Awesome, thank you.

Should the new method be added to the discovery document (with documentation)?

Good point. Added to discovery doc by adding a stub method in rails. The documentation string is "container_status container_request" which is weird but that's what we put in such doc strings and there isn't yet a mechanism for specifying a useful string. Should I add that now? Or change the template to "see doc.arvados.org/appropriate-methods-page"?

I think my vote is we plan to to add a mechanism for more useful strings as part of #19929. If you have ideas for how to do that, can you please add those in a comment on the ticket?

Re the FUSE fix: The comment in your commit message is great. I think it would be even more valuable as a comment in the code directly. Like I said, please let us know what issue you saw running tests on Py3.9.

Thanks.

Actions #17

Updated by Tom Clegg 10 months ago

  • Status changed from In Progress to Resolved
Actions #18

Updated by Peter Amstutz 8 months ago

  • Release set to 70
Actions

Also available in: Atom PDF