Story #12552

When priority is equal, the children of an earlier-submitted workflow should run first

Added by Tom Morris almost 3 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
11/03/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0
Release relationship:
Auto

Description

Currently containers are prioritized by descending priority, then creation time.

Typically, each workflow runs many child containers. When multiple workflows are running, the creation times of their child containers are interleaved, and therefore so is their execution order. As more work is added to the system, earlier workflows make slower and slower progress.

Desired behavior:

If two concurrent workflows have equal priority, the children of the earlier-submitted workflow should be prioritized ahead of children of the later-submitted workflow.


Subtasks

Task #13086: Review 12552-slurm-priorityResolvedLucas Di Pentima

Task #13122: Report workflow priority in containers#show API responseResolvedTom Clegg

Task #13159: Review 12552-wf-priorityResolvedTom Clegg


Related issues

Related to Arvados - Story #12574: [API] Propagate priority of parent container to childrenResolved11/30/2017

Related to Arvados - Bug #12573: Priority is ignored in Crunch2Resolved11/30/2017

Related to Arvados - Bug #13166: [node manager] wishlist should consist of top priority containersResolved03/26/2018

Associated revisions

Revision 6346a7c4
Added by Tom Clegg over 2 years ago

Merge branch '12552-slurm-priority'

refs #12552

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision f05926dd (diff)
Added by Tom Clegg over 2 years ago

12552: Fix starvation in test case.

refs #12552

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 4b563772
Added by Tom Clegg over 2 years ago

Merge branch '12552-wf-priority'

refs #12552

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Peter Amstutz over 2 years ago

  • Related to Story #12574: [API] Propagate priority of parent container to children added

#2 Updated by Peter Amstutz over 2 years ago

One way to do this is something like what slurm does: submit the first workflow at a high priority (say 900) and each subsequent workflow is given the next lower priority, 899, 898, 897, ... and the priority propagates to all child jobs (#12574)

#3 Updated by Tom Clegg over 2 years ago

Peter Amstutz wrote:

One way to do this is something like what slurm does: submit the first workflow at a high priority (say 900) and each subsequent workflow is given the next lower priority, 899, 898, 897, ... and the priority propagates to all child jobs (#12574)

I don't think we need to change the meaning of the "priority" field or ask users to track a moving target. If we sort first by top-level priority, then by top-level submit time, and propagate that order to the slurm queue (#12573) or other scheduling backend, it seems like users will get what they want: higher-priority workflows go first, and equal-priority workflows run ~serially instead of slowing one another down.

#4 Updated by Peter Amstutz over 2 years ago

  • Related to Bug #12573: Priority is ignored in Crunch2 added

#6 Updated by Tom Morris over 2 years ago

  • Tracker changed from Bug to Story

#7 Updated by Tom Clegg over 2 years ago

The container priority field would be much easier for dispatchers (and workbench and other clients) to use if it were offered as a single number, with the semantics "a container with priority=X has precedence over every container with priority<X."

The container request priority field should still be controlled directly by the caller, though, and the behavior requested here (i.e., if top-level containers WF1 and WF2 have equal priority, new children of WF1 run before old children of WF2) should still be possible.

Proposed implementation:
  • Container priority is an int64: low ~50 bits are a negative timestamp1, high ~10 bits are "Pmax", the highest priority of any container request that's being satisfied by this container
  • The timestamp is not necessarily the creation time of the container itself -- it's the creation time of the oldest top-level container that has priority=Pmax.

In this example, the container at the bottom has priority (2<<50)–1911 because the highest priority of any requesting CR is 2, and the oldest top-level container with priority=2 was created at 1911.

  CTR REQ                   CTR REQ                  CTR REQ                   
  priority=1                priority=2               priority=2
  created=1900              created=1910             created=1920              
      |                         |                        |                     
  CTR                       CTR                      CTR                       
  priority=(1<<50)-1901     priority=(2<<50)-1911    priority=(2<<50)-1921     
  created=1901              created=1911             created=1921              
      |                         |                        |                     
  CTR REQ                   CTR REQ                  CTR REQ                   
  priority=1                priority=2               priority=2
  created=1902              created=1912             created=1922
      |                         |                        |
      +-------------------------+------------------------+
      |
  CTR
  priority=(2<<50)-1911
  created=1904

This bottom-level container would be delayed only by a competing container with an ancestor that has
  • priority>2, or
  • priority=2 and created<1911.

1 javascript timestamp might be a reasonable choice: 44 bits are enough to span 1970-01-01 to 2527-06-23 with 1ms granularity.

#8 Updated by Tom Clegg over 2 years ago

Priority won't be in the 1...1000 range any more. Therefore it seems like we'd need to bundle this with the more complete SLURM scheduling fix we punted during #12573.

#9 Updated by Tom Clegg over 2 years ago

I think something like this would behave reasonably well, where
  • new jobs are submitted with nice=10000
  • config.PriorityElbowRoom is
    • ideally 1000
    • lower (10?) on sites where slurm nice values only go to 10000
    • lower (0?) on sites where Arvados-submitted jobs need to compete with other slurm tenants and it's more important to get priority higher than to minimize scontrol invocations
func reprioritize() {
    wantQueue := []struct {
        ctr      *arvados.Container
        priority int // current slurm priority
        nice     int // current slurm nice value
    }{}

    // Get current set of "our" containers (queued, running, and
    // locked) into wantQueue

    // Call "squeue" and fill in "priority" and "nice" fields of
    // wantQueue

    // Sort wantQueue by ctr.Priority desc, ctr.CreatedAt asc (i.e.,
    // wantQueue[0] should end up with the highest slurm priority)

    // Check & fix:
    var nextPriority int
    for i, ent := range wantQueue {
        adjust := 0    // how much priority should increase
        if i == 0 {
            // Might as well make the highest priority job
            // nice=0.
            adjust = ent.nice
        } else if ent.priority > nextPriority {
            // Current slurm priority is too high.
            adjust = nextPriority - ent.priority
            if i == len(wantQueue) || wantQueue[i+1].priority > nextPriority-1 {
                // We're going to renice the next item
                // anyway (or this is the last item).
                // We might as well make some elbow
                // room so we don't go full N^2.
                adjust -= config.PriorityElbowRoom
            }
            setSlurmNice(ent.ctr.UUID, ent.nice-adjust)
        } else if ent.priority < nextPriority-config.PriorityElbowRoom*10 {
            // Too much distance. This ensures we pull
            // newly submitted jobs up to a reasonable
            // priority number, even if they're at the
            // back of the queue.
            adjust = nextPriority - ent.priority - config.PriorityElbowRoom
        }
        if adjust > ent.nice {
            // Can't adjust that much without being root.
            // Just go as far as we can.
            adjust = ent.nice
        }
        if adjust != 0 {
            setSlurmNice(ent.ctr.UUID, ent.nice-adjust)
        }
        nextPriority = ent.priority + adjust - 1
    }
}

func setSlurmNice(ctrUUID string, newNiceValue int) {
    // ...
}

(moved from #12753#note-19)

#10 Updated by Tom Morris over 2 years ago

  • Target version changed from To Be Groomed to 2018-01-31 Sprint
  • Story points set to 2.0

#11 Updated by Tom Clegg over 2 years ago

  • Subject changed from Job scheduler should prioritize based on time workflow is submitted to When priority is equal, the children of an earlier-submitted workflow should run first
  • Description updated (diff)

#12 Updated by Tom Clegg over 2 years ago

  • Assigned To set to Tom Clegg

#13 Updated by Tom Morris over 2 years ago

  • Target version changed from 2018-01-31 Sprint to To Be Groomed

#14 Updated by Tom Morris over 2 years ago

  • Target version changed from To Be Groomed to 2018-02-28 Sprint

#15 Updated by Tom Clegg over 2 years ago

  • Status changed from New to In Progress

#16 Updated by Tom Clegg over 2 years ago

12552-slurm-priority @ 52e26b4e8bbbf505d6641becc435a939cee8c285
  • ensures slurm queue priority order matches arvados priority order
  • sorts on container priority and creation time
  • compatible with (but does not yet implement) api server reporting "workflow priority" (taking into account the highest-priority ancestor) in container response
  • "elbow room" config renamed to PrioritySpread and added to install docs

#17 Updated by Lucas Di Pentima over 2 years ago

Sorry for taking so long reviewing this, there are some concepts that I’m not being able to understand yet, for example the “nice” value of slurm, is it used to adjust the priority on the slurm side? If so, what happens to the nice value? is set to 0 by slurm once it’s used? Is it used the same way as on Linux processes (eg: greater niceness => lower priority) or the opposite? I suppose the TestReniceChurn test is simulating Slurm behavior, by adding the niceness to the priority so maybe the greater the nice value is, the less nice is the job?

I think that if the PrioritySpread documentation would need more detail, for example some guidance on which symptoms could be observed when having a non ideal value? Maybe some examples, etc?

If lots of scontrol executions are going to happen, can the time of adjustments be reduced by executing several commands at once? Like in a thread pool or similar... don't know if slurm is able to handle that.

Could it be that the main topic of this story (When priority is equal, the children of an earlier-submitted workflow should run first) is still pending to be tested? That is, AFAICT the tests implicitly accept that the jobs list is already sorted by the encoded priority/created_at value, maybe it's convenient to also test the encoding part?

#18 Updated by Tom Clegg over 2 years ago

Lucas Di Pentima wrote:

Sorry for taking so long reviewing this, there are some concepts that I’m not being able to understand yet, for example the “nice” value of slurm, is it used to adjust the priority on the slurm side? If so, what happens to the nice value? is set to 0 by slurm once it’s used? Is it used the same way as on Linux processes (eg: greater niceness => lower priority) or the opposite? I suppose the TestReniceChurn test is simulating Slurm behavior, by adding the niceness to the priority so maybe the greater the nice value is, the less nice is the job?

(priority reported by slurm) = (priority assigned by slurm at submit time) - (nice value)

(priority assigned by slurm at submit time) = some function of submit time/order, slurm configuration, etc., IOW, a bunch of stuff we can't control.

"more nice" = "lower priority".

I think that if the PrioritySpread documentation would need more detail, for example some guidance on which symptoms could be observed when having a non ideal value? Maybe some examples, etc?

Seems reasonable. How about:

"If your Arvados containers are waiting too long in the SLURM queue because their "nice" values are too high for them to compete with other SLURM jobs, you should use a lower PrioritySpread value."

If lots of scontrol executions are going to happen, can the time of adjustments be reduced by executing several commands at once? Like in a thread pool or similar... don't know if slurm is able to handle that.

I wondered about batching and/or parallelizing update invocations, and although I think slurm should be able to handle it, I'm wary of it being a premature optimization.

Could it be that the main topic of this story (When priority is equal, the children of an earlier-submitted workflow should run first) is still pending to be tested? That is, AFAICT the tests implicitly accept that the jobs list is already sorted by the encoded priority/created_at value, maybe it's convenient to also test the encoding part?

The thing implemented/tested so far is just reordering the slurm job priorities to match the Arvados container priorities.

The API server update (which isn't done yet) is still needed to make the Arvados priorities obey the rule "when container request priority is equal, children of an earlier-submitted workflow have a higher container priority".

Taken together, these should add up to "user-desired order is communicated to slurm scheduler".

#19 Updated by Lucas Di Pentima over 2 years ago

Thanks for the explanations. I think the additional comment on the documentation will be useful mainly for ops people who are upgrading and see scheduling behaviour changes.

With that, it LGTM.

#20 Updated by Tom Clegg over 2 years ago

  • Target version changed from 2018-02-28 Sprint to 2018-03-14 Sprint

#21 Updated by Tom Clegg over 2 years ago

  • Related to Bug #13166: [node manager] wishlist should consist of top priority containers added

#22 Updated by Tom Clegg over 2 years ago

Aside:

We seem to have a surprising API behavior: When a container_requests#create call is recognized as coming from a running container, the supplied priority is ignored, and the submitter's own priority is used instead -- even if the supplied priority is zero. In other cases, requesting priority=0 results in priority=0 (i.e., if state=Committed, find/create a suitable container, but don't run it). I think we should check for components that are relying on this behavior, and fix it. Meanwhile it looks like it shouldn't affect the current issue.

#23 Updated by Tom Clegg over 2 years ago

12552-wf-priority @ a9065fd962827b2cfae7993b550a386821a08dcb https://ci.curoverse.com/job/developer-run-tests/653/
  • when container request priority is equal, children of an earlier-submitted workflow have a higher container priority
  • when container request priority is higher, all of its children/dependencies also have higher container priority

Container priorities are equal in many cases (e.g., if only one workflow is running, its children all have the same priority). The slurm dispatcher doesn't have an "equal priority" concept so it sorts them next on container UUID to avoid incessant slurm-queue-rearranging.

It would be better to run containers in creation order when priority is equal. Working on that branch next.

#24 Updated by Lucas Di Pentima over 2 years ago

12552-wf-priority LGTM, with just one comment:

  • File services/api/test/unit/container_request_test.rb
    • Lines 328 & 329: I think assert_operator is meant to check parents[n] instead of children[n]

#25 Updated by Tom Clegg over 2 years ago

  • Target version changed from 2018-03-14 Sprint to 2018-03-28 Sprint

#26 Updated by Tom Clegg over 2 years ago

  • Status changed from In Progress to Resolved

#27 Updated by Tom Morris about 2 years ago

  • Release set to 17

Also available in: Atom PDF