Project

General

Profile

Actions

Feature #19320

open

Get actual instance price information by calling AWS APIs

Added by Peter Amstutz 6 months ago. Updated 3 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
01/23/2023
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

This data structure would be populated at a configurable `ContainerCostUpdateInterval` (default every 2 minutes?) for running containers, as well as immediately at the start and the end of a container.

For AWS spot instances, we would have a goroutine that stores the relevant spot price history (DescribeSpotPriceHistory API call) for the configured availability zone in the database. Awkwardly we currently derive the availability zone from `Containers/CloudVMs/DriverParameters/SubnetID`, perhaps we should add an explicit value in the config file). The history would only be stored up to the oldest running job (and maybe while nothing is running, up to the current point in time?). It could be refreshed on the same frequency as `ContainerCostUpdateInterval`, but ideally we would stop updating when nothing is running.

For regular AWS instances, we can get the hourly price for all node types in our AZ, API call TBD. This could be updated once a day (configurable?).

For both spot and regular AWS instances, we also need to account for the cost of any extra attached EBS storage. API call TBD.

As a bonus, or a follow-on, it would also be useful for the container request record to contain:

  • the total cost to run the process tree (sum of the container and all child containers)
  • the incremental cost to run the process tree (this is the previous sum excluding container requests that reused existing containers)

This could be calculated by the API server or controller and added to the container request (maybe as a property) so that it is easily available for display in Workbench.

Follow on work, outside the scope of this ticket:
  • Azure support for regular and spot instances
  • Costanalyzer support to use this cost data
  • Workbench2 support to show this cost data live
  • Gather interesting statistics for spot instances that could inform scheduling decisions in the future (e.g. how frequently certain node types get evicted, or the average spot price by instance type over time)

Subtasks 1 (1 open0 closed)

Task #19905: Review 19320-spot-pricingIn ProgressTom Clegg01/23/2023

Actions

Related issues

Related to Arvados Epics - Story #18179: Better spot instance supportIn Progress03/01/202204/30/2023

Actions
Blocked by Arvados - Feature #18205: [api] [cloud] add compute instance price to container recordResolvedTom Clegg08/08/2022

Actions
Actions #1

Updated by Peter Amstutz 6 months ago

  • Blocked by Feature #18205: [api] [cloud] add compute instance price to container record added
Actions #2

Updated by Peter Amstutz 6 months ago

  • Description updated (diff)
  • Subject changed from Get actual instance price information by calling cloud APIs to Get actual instance price information by calling AWS APIs
Actions #3

Updated by Peter Amstutz 6 months ago

  • Target version changed from 2022-09-14 sprint to 2022-08-31 sprint
Actions #4

Updated by Peter Amstutz 6 months ago

  • Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Actions #5

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #6

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Actions #7

Updated by Peter Amstutz 4 months ago

  • Target version deleted (2022-10-12 sprint)
Actions #8

Updated by Peter Amstutz 25 days ago

  • Target version set to 2023-01-18 sprint
Actions #9

Updated by Peter Amstutz 25 days ago

  • Story points set to 2.0
Actions #10

Updated by Peter Amstutz 25 days ago

  • Assigned To set to Tom Clegg
Actions #11

Updated by Tom Clegg 21 days ago

  • Status changed from New to In Progress
Actions #13

Updated by Peter Amstutz 12 days ago

  • Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Actions #14

Updated by Tom Clegg 10 days ago

Example test run using the ec2 driver's live mode (see top of source:lib/cloud/ec2_test.go):

What next? test lib/cloud/ec2 -check.vv -live-ec2-cfg /tmp/ec2config.yml -check.f=PriceHistory
======= test lib/cloud/ec2
START: ec2_test.go:286: EC2InstanceSetSuite.TestInstancePriceHistory
instances are running, and identifiable as spot instances
i-0aaebc82dbb7994a2 price history: [{2023-01-14 17:40:14 +0000 UTC 0.0035}]
i-0423b0e8cbd6f68bd price history: [{2023-01-14 17:40:14 +0000 UTC 0.0035}]
i-027d69380093cf500 price history: [{2023-01-14 17:40:14 +0000 UTC 0.0035}]
i-0eb8ac5e664ead7f1 price history: [{2023-01-14 17:40:14 +0000 UTC 0.0035}]
destroy instance i-0423b0e8cbd6f68bd
destroy instance i-027d69380093cf500
destroy instance i-0eb8ac5e664ead7f1
PASS: ec2_test.go:286: EC2InstanceSetSuite.TestInstancePriceHistory     17.139s

OK: 1 passed
PASS
ok      git.arvados.org/arvados.git/lib/cloud/ec2       17.152s
======= test lib/cloud/ec2 -- 19s
Pass: lib/cloud/ec2 tests (19s)
All test suites passed.
Implementation overview:
  • cloud.Instance interface has a new PriceHistory method, which returns the known price changes for the instance (or nil if not supported -- currently only supported by ec2 driver)
  • dispatcher sends each instance's current PriceHistory to crunch-run during "run probe" (crunch-run --list)
  • crunch-run --list writes the price history to a file in lockdir and sends SIGUSR2 to all crunch-run processes when checking aliveness (it used to send signal 0)
  • while running a container, crunch-run catches SIGUSR2 and loads the latest price history
  • when calculating cost to save in the container record, crunch-run uses the loaded price history if any, otherwise falls back to previous behavior (json in InstanceType env var)
  • ec2 driver's "list instances" implementation notices when spot instances are in use, looks up their availability zones (this uses a separate AWS API call), and looks up recent spot price data for the relevant instance types in those zones
  • ec2 driver parameter SpotPriceUpdateInterval (default 24h) is implemented as "if we already looked up pricing for all of the instance types/zones currently running, don't refresh; otherwise, refresh pricing for all running types/zones"
  • there is no attempt to calculate cost of AddedScratch (where is the best place to document this?)
Actions #15

Updated by Tom Clegg 7 days ago

One more feature:
  • Log entries in crunch-run.txt whenever the instance price changes, to provide feedback that the price checker is doing something, and make it possible to see how the total cost was calculated.

19320-spot-pricing @ c138f58b21edb574b101588f6fc61dce8a98ed3e -- developer-run-tests: #3453

Actions #16

Updated by Lucas Di Pentima 6 days ago

Questions related to lib/cloud/ec2/ec2.go:
  • Line 248:
    • Would it make sense to get the default interval value from the config? I'm thinking that it could get out of sync if we ever change it and forget to update this.
    • I think it could be useful to log a warning message when correcting this interval setting.
    • Also, would it be healthy to prevent too many API calls in a short period of time because of a misconfiguration? Does it make sense to allow this interval to be smaller than say, 5 minutes?
  • Line 322: The priceKey struct includes a 'spot' boolean member, that seems to always be true. Is this for future proofing?
  • Line 355:
    • Should that append() call be inside the previous conditional? We are only interested on instance type prices with stale information, right?
    • How about duplicated instance types? If we have many hundreds of spot instances of the same type, the filter might get a bit too large for the API call?

Just in case you missed it: There are additional features listed on this ticket like regular instance prices and extra costs related to attached storages that weren't addressed on this branch.

Actions #17

Updated by Tom Clegg 5 days ago

Lucas Di Pentima wrote in #note-16:

  • Would it make sense to get the default interval value from the config? I'm thinking that it could get out of sync if we ever change it and forget to update this.

The hard-coded default made tests work without specifying anything, but IRL config.default.yml already makes the default 24h, so it's only 0 if someone wrote that in the config. I've updated this so 0 now means "don't look up spot prices at all", which seems more useful than being able to write "0" as an alternate spelling of "24h".

  • I think it could be useful to log a warning message when correcting this interval setting.

On that note, I'm inclined to not log anything when it's been explicitly disabled by writing 0 -- thoughts?

  • Also, would it be healthy to prevent too many API calls in a short period of time because of a misconfiguration? Does it make sense to allow this interval to be smaller than say, 5 minutes?

I'm not sure about this. If someone wants to check every 30 seconds, who am I to say they shouldn't?

  • Line 322: The priceKey struct includes a 'spot' boolean member, that seems to always be true. Is this for future proofing?

Yes. I've added a comment at the lookup part to highlight that (currently) looking up a non-spot instance will never return anything.

  • Line 355:
    • Should that append() call be inside the previous conditional? We are only interested on instance type prices with stale information, right?

I could be persuaded otherwise, but my thinking here was that if we're going to do an API call to get prices, and that API lets us specify many instance types, we might as well get the latest prices for all of the types we're running. This way we do ~1 request when we start using a new instance type, plus 1 per specified interval. The other obvious way (update only the stale data) would have us doing multiple API calls in each interval even when there are no new instance types, because the cached data points' expiry times would start out staggered and never sync up.

If we're limiting the total number of API calls to less than { 1 per instance creation + 1 per interval }, and the cost of the price-list API call (even with lots of instance types) is much less than the cost of creating an instance, I'm thinking this is optimized enough. Does that sound right?

  • How about duplicated instance types? If we have many hundreds of spot instances of the same type, the filter might get a bit too large for the API call?

Oops, good catch, I forgot to deduplicate these. Fixed.

Just in case you missed it: There are additional features listed on this ticket like regular instance prices and extra costs related to attached storages that weren't addressed on this branch.

I've added costs for AddedScratch EBS volumes. I'm inclined to get what we've got so far finished/merged before wading into the non-spot prices, though -- does that sound OK?

(will update branch when I find out why live tests are failing.)

Actions #18

Updated by Tom Clegg 5 days ago

  • Related to Story #18179: Better spot instance support added
Actions #20

Updated by Lucas Di Pentima 4 days ago

Tom Clegg wrote in #note-17:

On that note, I'm inclined to not log anything when it's been explicitly disabled by writing 0 -- thoughts?

Yes, if the software doesn't make automatic corrections, no logging needed.

I'm not sure about this. If someone wants to check every 30 seconds, who am I to say they shouldn't?

Fair enough. I just thought that a potential typo of using for example a value of 6000 (as times are sometimes expressed in seconds) would produce API throttling situation on the dispatcher because it's making requests every 6 millisecs.

If we're limiting the total number of API calls to less than { 1 per instance creation + 1 per interval }, and the cost of the price-list API call (even with lots of instance types) is much less than the cost of creating an instance, I'm thinking this is optimized enough. Does that sound right?

Yeah, that's a nice optimization I didn't get when reading the code. Thanks for clarifying it.

I've added costs for AddedScratch EBS volumes. I'm inclined to get what we've got so far finished/merged before wading into the non-spot prices, though -- does that sound OK?

Sounds great! This LGTM, please merge.

Actions #21

Updated by Tom Clegg 3 days ago

I just thought that a potential typo of using for example a value of 6000 (as times are sometimes expressed in seconds) would produce API throttling situation on the dispatcher because it's making requests every 6 millisecs.

Ah, yes. Specifying 6000 is already an error -- we don't accept unitless non-zero numbers as arvados.Duration values for exactly that reason.

Actions

Also available in: Atom PDF