Story #9623

[Crunch2] [API] Reuses Containers to satisfy ContainerRequests

Added by Brett Smith almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
09/09/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0
Release:
Release relationship:
Auto

Description

In ContainerRequest#resolve, before creating a new Container, find out whether a Container already exists with
  • Exactly the same [command, cwd, environment, output_path, container_image, mounts, runtime_constraints] as the Container we're about to create
  • state in [Queued, Locked, Running, Complete]
  • (if state=Complete) exit_code==0
If multiple matching containers are found, take the first of:
  • The most recent container with state=Complete
  • The best Running container (order by progress desc, started_at asc -- IOW, most likely to finish soonest)
  • The best Locked container (order by priority desc, created_at asc -- IOW, most likely to start soonest)
  • The best Queued container (order by priority desc, created_at asc -- IOW, most likely to start soonest)
In order to efficiently compare equality of serialized fields like mounts, runtime_constraints, and environment, we must:
  • Recursively sort hash keys before putting them in the database. This can be done in a Container before_save hook.
  • Recursively sort hash keys before using them in a query like "where mounts=?". This must be done during ContainerRequest#resolve. (The keys do not need to be sorted in the ContainerRequest's serialized attrs.)
For now, we will not implement:
  • A mechanism allowing a client to skip re-use and force creation of a new container

Subtasks

Task #9999: Update Containers related documentation to explain how container reuse works.ResolvedRadhika Chippada

Task #9906: Review 9623-reuse-containersResolvedRadhika Chippada

Associated revisions

Revision 6430b728
Added by Lucas Di Pentima over 3 years ago

Merge branch '9623-reuse-containers'
Closes #9623

History

#1 Updated by Tom Clegg almost 4 years ago

  • Category set to API
  • Assigned To set to Tom Clegg

#2 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#3 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#4 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#5 Updated by Tom Morris almost 4 years ago

  • Target version set to Arvados Future Sprints
  • Story points set to 2.0

#6 Updated by Tom Clegg almost 4 years ago

  • Assigned To deleted (Tom Clegg)

#7 Updated by Tom Morris almost 4 years ago

  • Target version changed from Arvados Future Sprints to 2016-09-14 sprint

#8 Updated by Lucas Di Pentima almost 4 years ago

  • Assigned To set to Lucas Di Pentima

#9 Updated by Lucas Di Pentima almost 4 years ago

  • Status changed from New to In Progress

#10 Updated by Lucas Di Pentima over 3 years ago

f137e22a65cc29d00146a4735fe54d72b299984b

Complete test suite being run @ https://ci.curoverse.com/job/developer-test-job/191/

Container reuse according to the described criteria.
Tests written to cover the cases when several candidates exist for a single request.
Corrected some preexisting test to be compatible with this new behaviour.

#11 Updated by Radhika Chippada over 3 years ago

  • As Tom clarified in IRC, please split the big SQL query in find_reusable into a few as below
    • For state “Complete” : “select state=Complete group by output limit 2”. If you get 2, it means we have two different outputs and we do not want to reuse either. If you got just 1, do another query “select where state=Complete order by finished_at limit 1” and use this oldest record.
    • If no containers in Complete state (none or can’t reuse), get all containers in state = Running order by progress desc, started_at asc limit 1 and use it
    • Otherwise, get containers in state [Locked, Queued] order by state, priority desc, created_at asc, limit 1 and use it
  • It appears that the deep_sort_hash is copied over from job.rb. Please consider moving it to arvados_base from job instead of copying
  • Please remove output from other containers fixtures that are in non-Complete state
  • container_test.rb
    • You do not need to_json in these asserts in test “assert_equal c.environment.to_json, Container.deep_sort_hash(env).to_json”
    • Progress is between 0.0 to 1.0, so it would be nice if you change progress 15.0 to 0.15 etc
    • The test “find_reusable method should select running container most likely to finish sooner” has same progress for both first and second. Can you please add a test where progress is higher and that it gets selected
    • Can you add another test like “find_reusable method should not select failed container” where there is one failed and one running (or any other state you like) and verify that the other gets selected?
  • Ideally these new tests in container_test.rb could have been in container_request_test.rb so that we can see the reuse logic working as desired. Can you at least add a couple tests in container_request_test such that:
    • CR1 in Committed state; CR2 in Uncommitted state + assert container_uuid is nil; update CR2 to Committed state and assert CR2.container_uuid == CR1.container_uuid
    • CR1 in Committed state; CR2 with slightly different attributes in Uncommitted state + assert container_uuid is nil; update CR2 to Committed state and assert CR2.container_uuid != CR1.container_uuid

#12 Updated by Lucas Di Pentima over 3 years ago

Previous corrections addressed at a02b4f41cb495ecfea49ac0da1f6da0aec385f7d

Added an additional test to check that no completed container is selected if there are more than 1 different outputs.

Regarding the deep_sort_hash(env).to_json comment: If we just compare hashes and not serialized hashes, the assert will be successful even if the hashes are sorted differently, that's why they're serialized on the assertion.

Complete test suite being run @ https://ci.curoverse.com/job/developer-test-job/193/

#13 Updated by Radhika Chippada over 3 years ago

at a02b4f4

  • find_reusable
    • The "completed" SQL is pretty awesome :)
    • The completed.order should be “finished_at asc” so that we fetch the “oldest” completed CR
    • You can please accomplish the “Locked” and “Queued” state handling in one query using order by state asc
  • In test "Container request #{(env1 == env2) ? 'does' : 'does not'} reuse container when committed" , please add a not_nil assert on cr1.container_uuid at line 416 (or check cr2.container_uuid is not_nil after line 422)
  • In test "find_reusable method should not select completed container when inconsistent outputs exist”, you can call c_older and c_rerent as c_output1 and c_output2 instead
  • Nit: If you don’t mind, please replace the “winner” with “selected or chosen container” in the tests?

Thanks.

#14 Updated by Lucas Di Pentima over 3 years ago

Comments addressed at 3c69428
Thanks for the compliment! :)

#15 Updated by Radhika Chippada over 3 years ago

LGTM

#16 Updated by Lucas Di Pentima over 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:6430b728cbce69b56adc01432fc9351088724efd.

Also available in: Atom PDF