Story #7676

[Crunch] crunch-dispatch reserves the cheapest possible nodes for a job

Added by Brett Smith almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
11/23/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Sort available nodes by their cost first, number of cores second. Use the first available node(s) in that list.

Where nodes_available_for_job_now calls Node.find_each, generate a list of available nodes and sort them according to the above criteria. Iterate that list with the current loop as-is.


Subtasks

Task #7843: Sort nodes by price before choosingResolvedTom Clegg

Task #7844: Try refactored crunch-dispatch on stagingResolvedTom Clegg

Task #7782: Review 7676-dispatch-cheaper-nodesResolvedTom Clegg


Related issues

Related to Arvados - Story #7711: [Node Manager] Record a node's cloud cost information in the Arvados node record's propertiesResolved11/03/2015

Associated revisions

Revision 0c8cf79d
Added by Tom Clegg over 3 years ago

Merge branch '7676-dispatch-cheaper-nodes' closes #7676

History

#1 Updated by Brett Smith almost 4 years ago

  • Target version changed from 2015-11-11 sprint to Arvados Future Sprints

#2 Updated by Brett Smith almost 4 years ago

  • Subject changed from [Crunch] crunch-dispatch reserves the smallest possible nodes for a job to [Crunch] crunch-dispatch reserves the cheapest possible nodes for a job
  • Description updated (diff)
  • Story points set to 0.5

#3 Updated by Brett Smith almost 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-12-02 sprint

#4 Updated by Brett Smith almost 4 years ago

  • Assigned To set to Tom Clegg

#5 Updated by Tom Clegg almost 4 years ago

7676-dispatch-cheaper-nodes @ e525509 includes:
  • Choose cheaper nodes
  • Move crunch-dispatch code into lib/crunch_dispatch.rb so it can be tested
  • Unit test for "choose cheaper nodes"
  • Startup-and-shutdown test

#6 Updated by Tom Clegg over 3 years ago

  • Status changed from New to In Progress

#7 Updated by Radhika Chippada over 3 years ago

Review comments:

  • crunch_dispatch.rb
    • I focused on updates from 2f7095fe, under the assumption that the move from script to lib didn't involve other significant changes
    • line 163 (if need_procs.select { |node_test| not node_test.call(node) }.any?) — Can you please add a small comment for this that we are checking if the node meets runtime constraints or not. Also, “need_proc” might be a better name for “node_test”
  • crunch_dispatch_test
    • line 7: can you please add a new line here (after the Node.destroy_all) and add a comment such as “create few idle nodes with pricing info …”
  • I could not run the crunch_dispatch_test.rb unit test. (I did rebuild.) But, if you are able to run the test, I am fine with it
  • A question about sorting. Let’s say we have two nodes as follows: ['compute1', 1.60, 32], ['compute2', 1.60, 16] … (that is both compute1 and compute2 have the same price but compute1 has more cores than compute2). And, if I am looking for an 8 core node, would I get “compute2” since it is the smallest that meets my needs? I am wondering about this because otherwise job1 may get compute1 and if job2 needs more than 16 nodes but compute2 does not meet the needs and so we may have wasted resources. (Do we want to sort on price and cores?)

#8 Updated by Tom Clegg over 3 years ago

Radhika Chippada wrote:

  • I focused on updates from 2f7095fe, under the assumption that the move from script to lib didn't involve other significant changes

Right. My intent there was only to move the code, and make the minimum changes necessary for it to work that way. (I tried running it on staging, and that worked.)

  • line 163 (if need_procs.select { |node_test| not node_test.call(node) }.any?) — Can you please add a small comment for this that we are checking if the node meets runtime constraints or not. Also, “need_proc” might be a better name for “node_test”

SGTM, updated.

  • crunch_dispatch_test
    • line 7: can you please add a new line here (after the Node.destroy_all) and add a comment such as “create few idle nodes with pricing info …”

Added.

  • I could not run the crunch_dispatch_test.rb unit test. (I did rebuild.) But, if you are able to run the test, I am fine with it

Which test, and what was the error? The only wild guess I can think of off hand is: does /tmp/crunch_refresh_trigger exist?

  • A question about sorting. Let’s say we have two nodes as follows: ['compute1', 1.60, 32], ['compute2', 1.60, 16] … (that is both compute1 and compute2 have the same price but compute1 has more cores than compute2). And, if I am looking for an 8 core node, would I get “compute2” since it is the smallest that meets my needs? I am wondering about this because otherwise job1 may get compute1 and if job2 needs more than 16 nodes but compute2 does not meet the needs and so we may have wasted resources. (Do we want to sort on price and cores?)

There is definitely potential for making the wrong choice when there are nodes with equal price and different capabilities. However, I don't think this can be solved in the general case without knowing about future jobs: for example, if we have a job with cores=8,mem=8 and two equally expensive nodes {cores=8,mem=16 and cores=16,mem=8} then the best choice depends on whether the next job in the queue will want cores=16,nodes=8 or cores=8,mem=16. I think for the time being we're relying on the premise that when we guess wrong, starting up a new node won't cost much more than using an existing node anyway.

There are also races that make this solution sub-optimal: if we have a 1-cheap-node job and a 2-expensive-node job in the queue, node manager starts up all three nodes, and one of the expensive nodes come alive first, we will dispatch the cheap job to the expensive node because (at that moment) it's the only thing we can do, and the cheapest way to do it. I think we decided we're OK with that outcome, at least for now: although it's far from perfect, with this change we expect to make strictly better decisions than we did before in any given scenario.

#9 Updated by Brett Smith over 3 years ago

Tom Clegg wrote:

There are also races that make this solution sub-optimal: if we have a 1-cheap-node job and a 2-expensive-node job in the queue, node manager starts up all three nodes, and one of the expensive nodes come alive first, we will dispatch the cheap job to the expensive node because (at that moment) it's the only thing we can do, and the cheapest way to do it. I think we decided we're OK with that outcome, at least for now: although it's far from perfect, with this change we expect to make strictly better decisions than we did before in any given scenario.

This is all my understanding, too. This story isn't expected to fix every inefficiency in crunch-dispatch's node selection logic; just make a clear incremental improvement.

#10 Updated by Radhika Chippada over 3 years ago

Tom, those updates and answers are good.

Regarding the test failure: yes, /tmp/crunch_refresh_trigger exists. Below is the error I am seeing. As I said, if it is working for you and there are no undocumented manual steps involved, you can ignore this.

[~/arvados-dev/jenkins → master]$ ./run-tests.sh --leave-temp --temp /tmp/tmp.dP1jU9V6bb WORKSPACE=~/arvados --only services/api services/api_test="TEST=test/unit/crunch_dispatch_test.rb" 

/home/radhika/arvados/services/api/lib/crunch_dispatch.rb:5:in `<class:CrunchDispatch>': uninitialized constant CrunchDispatch::ApplicationHelper (NameError)
    from /home/radhika/arvados/services/api/lib/crunch_dispatch.rb:4:in `<top (required)>'
    from /home/radhika/arvados/services/api/test/unit/crunch_dispatch_test.rb:1:in `require'
    from /home/radhika/arvados/services/api/test/unit/crunch_dispatch_test.rb:1:in `<top (required)>'
    from /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb:15:in `require'
    from /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb:15:in `block in <main>'
    from /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb:4:in `select'
    from /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb:4:in `<main>'
rake aborted!
Command failed with status (1): [ruby -I"lib:test" -I"/home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib" "/home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb" "test/unit/crunch_dispatch_test.rb" -v]
/home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/railties-3.2.17/lib/rails/test_unit/testing.rake:50:in `block in <top (required)>'
/home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/bin/ruby_executable_hooks:15:in `eval'
/home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/bin/ruby_executable_hooks:15:in `<main>'
Tasks: TOP => test:single

#11 Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:0c8cf79d48283ecbe376ab958ad2ac90bbb34e59.

Also available in: Atom PDF