Project

General

Profile

Actions

Bug #13649

closed

c-d-s doesn't request a preemptible instance when it should

Added by Lucas Di Pentima over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

When a CR having the preemptible=true scheduling parameter set gets picked up by crunch-dispatch-slurm, it seems that the preemptible parameter gets ignored because c-d-s runs sbatch with the non-preemptible version of the instance type.


Subtasks 2 (0 open2 closed)

Task #13653: Review 13649-preemptible-committed-child-crResolvedPeter Amstutz06/21/2018Actions
Task #13682: Review 13649-spot-instances-docResolvedNico César06/26/2018Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Idea #7478: [Node Manager] Creates compute nodes using AWS spot instancesResolvedLucas Di Pentima05/25/2018Actions
Actions #1

Updated by Lucas Di Pentima over 6 years ago

  • Related to Idea #7478: [Node Manager] Creates compute nodes using AWS spot instances added
Actions #2

Updated by Lucas Di Pentima over 6 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Lucas Di Pentima over 6 years ago

Updates at b3f6c1442 - branch 13649-cds-preemptible-sbatch

This issue seems to not be related to a crunch-dispatch-slurm bug, after some more investigation I saw that although the container request included the preemptible scheduling parameter, its container didn't get it, so it seems that this is the reason for c-d-s to not be able to see it.
The issue is related to the other bugfix at branch 7478-auto-preemptible-cr-fix, having to do with the order that some callbacks are called: set_container is set to be run before the callback that set the default scheduling parameters value, so the newly created container copies an empty scheduling_parameters field from its request.

On this branch I first merged 7478-auto-preemptible-cr-fix because it solves a similar issue, and then wrote a test that exposes the bug. I'm not sure which is the proper way to solve this, I tried reordering the callbacks, but lots of tests fail.

Actions #4

Updated by Lucas Di Pentima over 6 years ago

Updates at 0e94528878341faa941247eb3d990230c941868c - branch 13649-preemptible-committed-child-cr (rebased, renamed branch)
Test run: https://ci.curoverse.com/job/developer-run-tests/773/

Changed the entire approach, reverted the set_requesting_container_uuid callback to be called on before_create, but added a private method that returns the requesting container even though it isn't assigned to the CR (yet) so it can be used from another callback before it's assigned.
Moved set_default_preemptible_scheduling_parameter to be run after set_container so that the resolved container gets the correct scheduling parameters.

Actions #5

Updated by Peter Amstutz over 6 years ago

Lucas Di Pentima wrote:

Updates at 0e94528878341faa941247eb3d990230c941868c - branch 13649-preemptible-committed-child-cr (rebased, renamed branch)
Test run: https://ci.curoverse.com/job/developer-run-tests/773/

Changed the entire approach, reverted the set_requesting_container_uuid callback to be called on before_create, but added a private method that returns the requesting container even though it isn't assigned to the CR (yet) so it can be used from another callback before it's assigned.
Moved set_default_preemptible_scheduling_parameter to be run after set_container so that the resolved container gets the correct scheduling parameters.

Couple of small testing suggestions:

assert_not_nil cr.requesting_container_uuid

Should probably be

assert_equal 'zzzzz-dz642-runningcontainr', cr.requesting_container_uuid

You added two tests, but they appear to be nearly identical. Could you merge them into one test?

Rest LGTM.

Actions #6

Updated by Lucas Di Pentima over 6 years ago

Branch 13649-spot-instances-doc - e9e440847c574d1152b128cfd508cf473f49121b

Documentation page on Admin section that describes the required configuration changes to be able to use AWS Spot instances.

Actions #7

Updated by Lucas Di Pentima over 6 years ago

For cost tracking, AWS offers a spot instance data feed that creates hourly logs on an S3 bucket: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-data-feeds.html

I've done some tests with t2.micro instances (price being $0.0116 for on-demand instances), asked t2.micro spot instances with max_price=0.005 (price history on us-west regions show 0.0035 to be the lowest price) and got this tab separated values file on the S3 bucket:

#Version: 1.0
#Fields: Timestamp UsageType Operation InstanceID MyBidID MyMaxPrice MarketPrice Charge Version
2018-06-25 19:45:04 UTC    USW2-SpotUsage:t2.micro    RunInstances:SV001    i-0fc4474c7952e98d4    sir-qqci9gxn    0.005 USD    0.004 USD    0.004 USD    1
Actions #8

Updated by Tom Morris over 6 years ago

I thought we said that we were going to bid the default (ie on-demand) price rather than trying to figure out any exotic bidding strategy. We get the spot market price even if we bid "too high."

Actions #9

Updated by Lucas Di Pentima over 6 years ago

Tom Morris wrote:

I thought we said that we were going to bid the default (ie on-demand) price rather than trying to figure out any exotic bidding strategy. We get the spot market price even if we bid "too high."

Yes, that's what we're doing. The price data feed is to know the actual price that Amazon's charges for every instance after every running hour, so that we can do cost analysis on our tests.

Actions #10

Updated by Nico César over 6 years ago

review at e9e440847c574d1152b128cfd508cf473f49121b

the phrase
"If attempting to use this feature with an already working AWS account and getting an error message on nodemanager's log like this one:"

sounds long and confusing, maybe "If this is the case check nodemanager's log for:" is better?

Changing the "you" reference: "...linked role created. For this, you can log into your AWS account, go to " by "This can be done by logging int the AWS account..."

besides that small change I feel that the documentation should be ready to merge

Actions #11

Updated by Lucas Di Pentima over 6 years ago

  • Status changed from In Progress to Resolved
Actions #12

Updated by Tom Morris over 6 years ago

  • Release set to 13
Actions

Also available in: Atom PDF