Idea #7676
closed[Crunch] crunch-dispatch reserves the cheapest possible nodes for a job
Added by Brett Smith about 9 years ago. Updated about 9 years ago.
Description
Sort available nodes by their cost first, number of cores second. Use the first available node(s) in that list.
Where nodes_available_for_job_now calls Node.find_each, generate a list of available nodes and sort them according to the above criteria. Iterate that list with the current loop as-is.
Updated by Brett Smith about 9 years ago
- Target version changed from 2015-11-11 sprint to Arvados Future Sprints
Updated by Brett Smith about 9 years ago
- Subject changed from [Crunch] crunch-dispatch reserves the smallest possible nodes for a job to [Crunch] crunch-dispatch reserves the cheapest possible nodes for a job
- Description updated (diff)
- Story points set to 0.5
Updated by Brett Smith about 9 years ago
- Target version changed from Arvados Future Sprints to 2015-12-02 sprint
Updated by Tom Clegg about 9 years ago
- Choose cheaper nodes
- Move crunch-dispatch code into lib/crunch_dispatch.rb so it can be tested
- Unit test for "choose cheaper nodes"
- Startup-and-shutdown test
Updated by Radhika Chippada about 9 years ago
Review comments:
- crunch_dispatch.rb
- I focused on updates from 2f7095fe, under the assumption that the move from script to lib didn't involve other significant changes
- line 163 (if need_procs.select { |node_test| not node_test.call(node) }.any?) — Can you please add a small comment for this that we are checking if the node meets runtime constraints or not. Also, “need_proc” might be a better name for “node_test”
- crunch_dispatch_test
- line 7: can you please add a new line here (after the Node.destroy_all) and add a comment such as “create few idle nodes with pricing info …”
- I could not run the crunch_dispatch_test.rb unit test. (I did rebuild.) But, if you are able to run the test, I am fine with it
- A question about sorting. Let’s say we have two nodes as follows: ['compute1', 1.60, 32], ['compute2', 1.60, 16] … (that is both compute1 and compute2 have the same price but compute1 has more cores than compute2). And, if I am looking for an 8 core node, would I get “compute2” since it is the smallest that meets my needs? I am wondering about this because otherwise job1 may get compute1 and if job2 needs more than 16 nodes but compute2 does not meet the needs and so we may have wasted resources. (Do we want to sort on price and cores?)
Updated by Tom Clegg about 9 years ago
Radhika Chippada wrote:
- I focused on updates from 2f7095fe, under the assumption that the move from script to lib didn't involve other significant changes
Right. My intent there was only to move the code, and make the minimum changes necessary for it to work that way. (I tried running it on staging, and that worked.)
- line 163 (if need_procs.select { |node_test| not node_test.call(node) }.any?) — Can you please add a small comment for this that we are checking if the node meets runtime constraints or not. Also, “need_proc” might be a better name for “node_test”
SGTM, updated.
- crunch_dispatch_test
- line 7: can you please add a new line here (after the Node.destroy_all) and add a comment such as “create few idle nodes with pricing info …”
Added.
- I could not run the crunch_dispatch_test.rb unit test. (I did rebuild.) But, if you are able to run the test, I am fine with it
Which test, and what was the error? The only wild guess I can think of off hand is: does /tmp/crunch_refresh_trigger exist?
- A question about sorting. Let’s say we have two nodes as follows: ['compute1', 1.60, 32], ['compute2', 1.60, 16] … (that is both compute1 and compute2 have the same price but compute1 has more cores than compute2). And, if I am looking for an 8 core node, would I get “compute2” since it is the smallest that meets my needs? I am wondering about this because otherwise job1 may get compute1 and if job2 needs more than 16 nodes but compute2 does not meet the needs and so we may have wasted resources. (Do we want to sort on price and cores?)
There is definitely potential for making the wrong choice when there are nodes with equal price and different capabilities. However, I don't think this can be solved in the general case without knowing about future jobs: for example, if we have a job with cores=8,mem=8 and two equally expensive nodes {cores=8,mem=16 and cores=16,mem=8} then the best choice depends on whether the next job in the queue will want cores=16,nodes=8 or cores=8,mem=16. I think for the time being we're relying on the premise that when we guess wrong, starting up a new node won't cost much more than using an existing node anyway.
There are also races that make this solution sub-optimal: if we have a 1-cheap-node job and a 2-expensive-node job in the queue, node manager starts up all three nodes, and one of the expensive nodes come alive first, we will dispatch the cheap job to the expensive node because (at that moment) it's the only thing we can do, and the cheapest way to do it. I think we decided we're OK with that outcome, at least for now: although it's far from perfect, with this change we expect to make strictly better decisions than we did before in any given scenario.
Updated by Brett Smith about 9 years ago
Tom Clegg wrote:
There are also races that make this solution sub-optimal: if we have a 1-cheap-node job and a 2-expensive-node job in the queue, node manager starts up all three nodes, and one of the expensive nodes come alive first, we will dispatch the cheap job to the expensive node because (at that moment) it's the only thing we can do, and the cheapest way to do it. I think we decided we're OK with that outcome, at least for now: although it's far from perfect, with this change we expect to make strictly better decisions than we did before in any given scenario.
This is all my understanding, too. This story isn't expected to fix every inefficiency in crunch-dispatch's node selection logic; just make a clear incremental improvement.
Updated by Radhika Chippada about 9 years ago
Tom, those updates and answers are good.
Regarding the test failure: yes, /tmp/crunch_refresh_trigger exists. Below is the error I am seeing. As I said, if it is working for you and there are no undocumented manual steps involved, you can ignore this.
[~/arvados-dev/jenkins → master]$ ./run-tests.sh --leave-temp --temp /tmp/tmp.dP1jU9V6bb WORKSPACE=~/arvados --only services/api services/api_test="TEST=test/unit/crunch_dispatch_test.rb" /home/radhika/arvados/services/api/lib/crunch_dispatch.rb:5:in `<class:CrunchDispatch>': uninitialized constant CrunchDispatch::ApplicationHelper (NameError) from /home/radhika/arvados/services/api/lib/crunch_dispatch.rb:4:in `<top (required)>' from /home/radhika/arvados/services/api/test/unit/crunch_dispatch_test.rb:1:in `require' from /home/radhika/arvados/services/api/test/unit/crunch_dispatch_test.rb:1:in `<top (required)>' from /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb:15:in `require' from /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb:15:in `block in <main>' from /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb:4:in `select' from /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb:4:in `<main>' rake aborted! Command failed with status (1): [ruby -I"lib:test" -I"/home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib" "/home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/rake-10.2.2/lib/rake/rake_test_loader.rb" "test/unit/crunch_dispatch_test.rb" -v] /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/gems/railties-3.2.17/lib/rails/test_unit/testing.rake:50:in `block in <top (required)>' /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/bin/ruby_executable_hooks:15:in `eval' /home/radhika/.rvm/gems/ruby-2.1.1@arvados-tests/bin/ruby_executable_hooks:15:in `<main>' Tasks: TOP => test:single
Updated by Tom Clegg about 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:0c8cf79d48283ecbe376ab958ad2ac90bbb34e59.