Story #6429

[API] [Crunch2] Implement "containers" and "container requests" tables, models and controllers

Added by Peter Amstutz over 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
12/03/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0

Description

See Containers API and Crunch2 migration

For this initial implementation, focus on implementing and testing a minimal feature set establishing the life cycle of Container and ContainerRequest objects, thus enabling us to develop and test other Crunch2 components.

Implementation notes

Implement validations/restrictions and methods at the model layer, not in controller methods.

Tests

Access control
  • Non-admin users cannot modify Container records
  • Non-admin users cannot modify the container_uuid of a ContainerRequest, except by setting it to null when state==Uncommitted.
State changes and validations
  • ContainerRequest state defaults to Uncommitted.
  • ContainerRequest state transition ∈ {Uncommitted→Committed, Committed→Final} -- any other transition causes an error.
  • ContainerRequest cannot be modified by anyone if state=Final.
  • ContainerRequest cannot be modified if state=Committed -- except priority, container_uuid, and container_count_max.
  • Container state ∈ {Queued, Running, Cancelled, Failed, Complete}.
  • Container state transition ∈ {Queued→Running, Queued→Cancelled, Running→Cancelled, Running→Failed, Running→Complete} -- any other transition causes an error.
  • properties attribute must be a Hash, etc. (use existing patterns for handling serialized attributes)
Creating and stopping Container objects
  • When a ContainerRequest's state changes from Uncommitted→Committed (regardless of priority), create a new Container populated with the relevant ContainerRequest attributes. Store the new Container's UUID as the ContainerRequest's container_uuid.
  • When a ContainerRequest changes priority (which means it is Committed and has), update Container priority to the max priority of any ContainerRequest with the corresponding container_uuid.
  • When a ContainerRequest cancels or changes priority to zero, and this leaves its Container running but with priority=0, cancel the container.
  • When changing Container state away from Running, cancel all active ContainerRequests that were initiated by this Container (see requesting_container_uuid).
Not needed for this initial implementation:
  • Locking mechanism for running containers. We will certainly need it for production, but it isn't included in this increment. For now we'll accept the limitations, like a maximum of one dispatcher running at a time.
  • Re-use an existing container that meets the criteria of a ContainerRequest. For now every ContainerRequest that gets committed will result in a new Container being created.
  • Check whether repositories and inputs are readable by the submitter.
  • Dispatch/execute the containers that get created. For now all containers will just stay queued, because there is no component to dispatch them yet.
  • API documentation. The feature doesn't work yet, so we don't need to advertise it.
  • expires_at, filters, container_count_max (for now these must be null/empty)
  • validation of runtime_constraints, container_image, environment, cwd, etc.

Subtasks

Task #7910: Write testsResolvedPeter Amstutz

Task #7911: Review 6429-crunch2-apiResolvedPeter Amstutz

Task #7909: Implement models & controllersResolvedPeter Amstutz


Related issues

Related to Arvados - Story #6282: [Crunch] Write stories for implementation of Crunch v2Resolved06/23/2015

Related to Arvados - Story #8001: [Draft] [Crunch2] arv-container to run containersDuplicate

Related to Arvados - Feature #6519: [Workbench] [Crunch2] index and show pages for Containers and Container RequestsResolved07/08/2015

Blocks Arvados - Feature #6518: [Crunch] [Crunch2] Dispatch containers via slurmResolved07/08/2015

Blocks Arvados - Feature #6520: [Node Manager] [Crunch2] Take queued containers into account when computing how many nodes should be upResolved07/08/2015

Blocks Arvados - Feature #7816: [Crunch2] Execute minimal container spec with loggingResolved11/17/2015

Associated revisions

Revision a1adf1ed
Added by Peter Amstutz almost 4 years ago

Merge branch '6429-crunch2-api' closes #6429

Revision e74e37c4 (diff)
Added by Peter Amstutz almost 4 years ago

Small test cleanup. refs #6429

Revision 4e5342f4 (diff)
Added by Peter Amstutz almost 4 years ago

Cleanup of test cleanup. refs #6429

Revision d71335c6 (diff)
Added by Peter Amstutz almost 4 years ago

Add required indexes to containers and container_requests tables. refs #6429

Revision 08ee1fa8
Added by Peter Amstutz almost 4 years ago

Merge branch '6429-fix-container-indexes' refs #6429

History

#1 Updated by Peter Amstutz over 4 years ago

  • Subject changed from [API] [Crunch] Implement "containers" and "container requests" tables, models and controllers to [API] [Crunch v2] Implement "containers" and "container requests" tables, models and controllers

#2 Updated by Tom Clegg over 4 years ago

  • Project changed from Arvados Private to Arvados
  • Subject changed from [API] [Crunch v2] Implement "containers" and "container requests" tables, models and controllers to [API] [Crunch2] Implement "containers" and "container requests" tables, models and controllers
  • Category set to API

#3 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)
  • Story points set to 2.0

#4 Updated by Tom Clegg about 4 years ago

  • Description updated (diff)

#5 Updated by Tom Clegg about 4 years ago

  • Description updated (diff)

#6 Updated by Brett Smith about 4 years ago

  • Story points changed from 2.0 to 3.0

#7 Updated by Peter Amstutz about 4 years ago

  • Category changed from API to Crunch
  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz

#8 Updated by Peter Amstutz about 4 years ago

A few thoughts:

How do we handle failure? Is the dispatcher required to retry containers that fail, or is the dispatcher a "best effort" service and the API decides to retry by scheduling a new container?

Currently the container_uuid field only holds a single container_uuid at a time. If the API schedules a new container, does that mean any container requests associated with that container get updated with the new container?

If the container_uuid field only holds one container at a time, and container don't link back to the container requests that created, then we don't have a way to record of past attempts to fulfill this request. This means we don't have anything to check against container_count_max. A few possible solutions:

  • Make container_uuid an array of containers created to fulfill a given container request (this introduces complexity)
  • Decrement container_count_max on the request when submitting a new container
  • Compute content address of the container request and discover containers with that content address. This would conflict with "no reuse" or "impure" requests which are supposed to ignore past execution history. Could solve this by salting the content address with a timestamp; "no reuse" containers would never ever be reusable which might be fine.

Also:

I think we should distinguish between infrastructure failure and task failure by distinguishing between "TempFail" and "PermFail" in the container state. "TempFail" shouldn't count againt the container_count_max count, or alternately we only honor container_count_max for "TempFail" tasks and don't retry "PermFail".

Ideally, "TempFail" containers should retry forever, but with a backoff. One way to do the backoff is to schedule the container to run at a specific time in the future.

Having a field specifying "wait until time X to run this container" would be generally useful for cron-style tasks.

#9 Updated by Peter Amstutz about 4 years ago

Also missing: a "parent container" or "parent container request" concept to build process trees.

#10 Updated by Peter Amstutz about 4 years ago

To elaborate, suggest the container request have an optional "parent_container_uuid" field. The process tree can be built by looking "container -> request -> container -> request -> ..."

#11 Updated by Peter Amstutz about 4 years ago

Assuming the above notes are not part of this story (for the development spike of getting something working end to end as quickly as possible we can probably ignore error handling for now), we will need a follow on story to hash out and implement these details.

#12 Updated by Tom Clegg about 4 years ago

Also missing: a "parent container" or "parent container request" concept to build process trees.

This is what requesting_container_uuid is for, if I get your drift. E.g., "cancel all active ContainerRequests that were initiated by this Container" in the issue description.

I think we should distinguish between infrastructure failure and task failure by distinguishing between "TempFail" and "PermFail" in the container state.

I disagree. Once a container stops, it stays stopped; there is nothing temporary about it. A client's decision about whether to start a new container B after container A fails is not a property of Container A.

Deciding how much a given Container counts toward a ContainerRequest's container_count_max is an interesting issue, but however we do it I don't think it goes in the Container and I don't think it should block this story.

We probably do need to add something like "exit_code" to Container but let's implement what we have rather than wait for that bikeshed to be painted. Database migrations aren't that hard.

I don't think any of the other points in notes 8..11 need to be settled in order to implement this story as described.

#6518 is the dispatch story. Starting a page about how a dispatcher is supposed to work (on a Crunch2 dispatch wiki page), including the above questions, would be a good idea.

#13 Updated by Peter Amstutz about 4 years ago

Tom Clegg wrote:

Also missing: a "parent container" or "parent container request" concept to build process trees.

This is what requesting_container_uuid is for, if I get your drift. E.g., "cancel all active ContainerRequests that were initiated by this Container" in the issue description.

I think we should distinguish between infrastructure failure and task failure by distinguishing between "TempFail" and "PermFail" in the container state.

I disagree. Once a container stops, it stays stopped; there is nothing temporary about it. A client's decision about whether to start a new container B after container A fails is not a property of Container A.

"tempfail" may be a bad name. However in principal I think it is important to try and distinguish between "this job could not complete for reasons outside of its control" and "this job failed due to user mistake" (if nothing else, so that the end user knows when a task failure is not their fault). The decision to schedule a container B may not be a "property" of A but it is at least partly dependent on the outcome of A.

Deciding how much a given Container counts toward a ContainerRequest's container_count_max is an interesting issue, but however we do it I don't think it goes in the Container and I don't think it should block this story.

No it doesn't have to block this story but figuring out where retry logic is going to live (API server, dispatcher, workflow runner...) has major design implications.

We probably do need to add something like "exit_code" to Container but let's implement what we have rather than wait for that bikeshed to be painted. Database migrations aren't that hard.

I'm trying to pin down what "Failed" is intended to mean for a container. Perhaps you are suggesting that a job that runs to completion regardless of its exit code is "Complete" and that "Failed" be reserved for infrastructure errors?

I don't think any of the other points in notes 8..11 need to be settled in order to implement this story as described.

Ok. If the goal is to implement the minimum API server support to be able to start prototyping the dispatcher next sprint, then I'm on board with that.

#6518 is the dispatch story. Starting a page about how a dispatcher is supposed to work (on a Crunch2 dispatch wiki page), including the above questions, would be a good idea.

I'll start adding notes on that page.

#14 Updated by Brett Smith about 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-12-16 sprint

#15 Updated by Tom Clegg about 4 years ago

Peter Amstutz wrote:

Ok. If the goal is to implement the minimum API server support to be able to start prototyping the dispatcher next sprint, then I'm on board with that.

Indeed. From the story description: "For this initial implementation, focus on implementing and testing a minimal feature set establishing the life cycle of Container and ContainerRequest objects, thus enabling us to develop and test other Crunch2 components."

The idea is that implementing this now as (imperfectly) specified will unblock several other stories; this means we'll have some code churn as we perfect the models, but unblocking is important.

#16 Updated by Tom Clegg about 4 years ago

  • Description updated (diff)

#17 Updated by Peter Amstutz almost 4 years ago

6429-crunch2-api ready for review.

#18 Updated by Tom Clegg almost 4 years ago

at 6fc8952

apps/workbench/app/models/container_requests.rb and containers.rb -- filenames should be singular

Container.resolve(req)
  • This looks like it would be more comfortable as a ContainerRequest instance method -- like c=req.resolve() instead of c=Container.resolve(req)?
  • Literal string "Committed" should be ContainerRequest.Committed

I'm not seeing why Container#process_tree_priority is needed. It looks like a "cancel a container by setting its priority to 0" feature but I don't think we want this. The "sometimes A=0 because B=0, and sometimes B=0 because A=0" aspect seems confusing; I think "user always sets request priority, system always sets container priority" would be easier to follow.

Migration
  • command, environment, mounts, runtime_constraints, and properties should be text instead of varchar(255) (containers.environment is right, but the rest aren't)
  • container_count_max should be NOT NULL

services/api/test/factories/*.rb look bogus, I'm guessing "rails g migration" made them and they should be deleted?

The where/each/save loop in Container#request_finalize should use ContainerRequest.update_all({state: ContainerRequest.Final}, ['container_uuid=?', uuid])

Error message "Cannot only update container_uuid to nil." should say "can", not "cannot".

Errors should not be capitalized, and should not repeat the field name (avoid "container_uuid Has not been resolved to a container" and "Finished at Illegal update of field finished_at" coming from x.full_messages)

ContainerRequest state changes Final→Committed and Committed→Uncommitted are currently possible, but should not be.

The way state transitions are checked is rather verbose and error-prone. How about starting off validate_change by doing something like "if !state_transitions[state_was].include?(state)", and then doing all the other state-dependent validations once we've got that out of the way?

state_transitions = {
  nil => [Uncommitted, Committed],
  Uncommitted => [Uncommitted, Committed],
  Committed => [Committed, Final],
  Final => [Final],
}

#19 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

at 6fc8952

apps/workbench/app/models/container_requests.rb and containers.rb -- filenames should be singular

Fixed.

Container.resolve(req)
  • This looks like it would be more comfortable as a ContainerRequest instance method -- like c=req.resolve() instead of c=Container.resolve(req)?

Done.

  • Literal string "Committed" should be ContainerRequest.Committed

Fixed.

I'm not seeing why Container#process_tree_priority is needed. It looks like a "cancel a container by setting its priority to 0" feature but I don't think we want this. The "sometimes A=0 because B=0, and sometimes B=0 because A=0" aspect seems confusing; I think "user always sets request priority, system always sets container priority" would be easier to follow.

Ambigious use of the word "cancel" in the description:

"When changing Container state away from Running, cancel all active ContainerRequests that were initiated by this Container (see requesting_container_uuid)."

Since "priority == 0" means "please cancel this" I interpreted this to mean that the container request should have it priority adjusted.

But I think what we actually want is that when the container completes, both the container request that created the container, and any container requests created by the container are finalized.

Migration
  • command, environment, mounts, runtime_constraints, and properties should be text instead of varchar(255) (containers.environment is right, but the rest aren't)
  • container_count_max should be NOT NULL

Fixed.

services/api/test/factories/*.rb look bogus, I'm guessing "rails g migration" made them and they should be deleted?

Correct. Removed.

The where/each/save loop in Container#request_finalize should use ContainerRequest.update_all({state: ContainerRequest.Final}, ['container_uuid=?', uuid])

I disagree with this, because update_all skips validation and callbacks and as a result makes it harder to reason about the behavior (by having to consider code paths that don't run validations or callbacks...)

Error message "Cannot only update container_uuid to nil." should say "can", not "cannot".

Fixed.

Errors should not be capitalized, and should not repeat the field name (avoid "container_uuid Has not been resolved to a container" and "Finished at Illegal update of field finished_at" coming from x.full_messages)

Fixed.

ContainerRequest state changes Final→Committed and Committed→Uncommitted are currently possible, but should not be.

The way state transitions are checked is rather verbose and error-prone. How about starting off validate_change by doing something like "if !state_transitions[state_was].include?(state)", and then doing all the other state-dependent validations once we've got that out of the way?

You're right, that's a nicer way to implement it. Done.

#20 Updated by Tom Clegg almost 4 years ago

Peter Amstutz wrote:

Ambigious use of the word "cancel" in the description:

"When changing Container state away from Running, cancel all active ContainerRequests that were initiated by this Container (see requesting_container_uuid)."

Since "priority == 0" means "please cancel this" I interpreted this to mean that the container request should have it priority adjusted.

But I think what we actually want is that when the container completes, both the container request that created the container, and any container requests created by the container are finalized.

No, sorry, I think I was confused here: I misread the process_tree_priority method and thought it was cancelling the CRs that requested the finished container. It's cancelling the CRs that were requested by the finishing job, which is right. If the requester abandons them by exiting without waiting for the CRs to be satisfied, there is no point in satisfying them, or leaving them running (unless of course some other CRs are using them too). So we just have to set req.priority=0 and the CR hooks will take care of updating priority on the relevant containers, etc.

However, it seems to twig on priority instead of state. I think we want something like this, right?
  • if state_was == Running and state_changed?
      act_as_system_user do
        ContainerRequest.where('requesting_container_uuid=? and priority>0 and state in (?)', uuid, [Queued,Running]).each do |cr|
          cr.priority = 0
          cr.save!
        end
      end
    end

We don't want to touch the priority of the CRs that requested the finished container. We do want to finalize them -- at least if we're not going to reattempt, which we never do yet -- but that's already handled in request_finalize.

The where/each/save loop in Container#request_finalize should use ContainerRequest.update_all({state: ContainerRequest.Final}, ['container_uuid=?', uuid])

I disagree with this, because update_all skips validation and callbacks and as a result makes it harder to reason about the behavior (by having to consider code paths that don't run validations or callbacks...)

Fair enough. In fact, I suppose we'll crash here if there are any Uncommitted CRs referencing this container. We should select for state=Committed -- and I suppose CRs with state=Uncommitted should get their container_uuid reset to nil instead? This seems like a worthwhile test case.

Can you do a bit of copy editing in the tests?
  • Some of the test names like "Container container complete" are a bit mysterious
  • Using c for a ContainerRequest, and t for a Container, seems a bit obfuscated (how about cr and c?)
Containers API says priority should be null (not zero) if state != Committed:
  •    t = Container.find_by_uuid c.container_uuid
        assert_equal 5, t.priority
    
        c.state = "Final" 
        c.save!
    
        t.reload
        assert_equal 0, t.priority
    

#21 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

No, sorry, I think I was confused here: I misread the process_tree_priority method and thought it was cancelling the CRs that requested the finished container. It's cancelling the CRs that were requested by the finishing job, which is right. If the requester abandons them by exiting without waiting for the CRs to be satisfied, there is no point in satisfying them, or leaving them running (unless of course some other CRs are using them too). So we just have to set req.priority=0 and the CR hooks will take care of updating priority on the relevant containers, etc.

Ok, the hook is now "handle_completed"

When the state changes to Cancelled or Complete:

  • CR's that requested this container are moved to Final
  • Committed CR's requested by this container have their priority set to 0

This will have the effect that child container requests are will be cancelled from the top down; a child process won't be cancelled until the parent process has terminated.

We could also propagate priority changes directly: setting priority of the parent container (to anything, not just 0) could automatically propagate downward to any child container requests; this would allow for raising or lowing priority on the whole process tree.

Fair enough. In fact, I suppose we'll crash here if there are any Uncommitted CRs referencing this container. We should select for state=Committed -- and I suppose CRs with state=Uncommitted should get their container_uuid reset to nil instead? This seems like a worthwhile test case.

It already only selects committed container requests.

Can you do a bit of copy editing in the tests?
  • Some of the test names like "Container container complete" are a bit mysterious
  • Using c for a ContainerRequest, and t for a Container, seems a bit obfuscated (how about cr and c?)

Done.

Containers API says priority should be null (not zero) if state != Committed:
  • [...]

Done, added a nil validation check and tests.

#22 Updated by Tom Clegg almost 4 years ago

A test for the "cancel remaining child jobs via requesting_container_uuid when parent ends" behavior would still be good (unless it's here and I'm just not seeing it?). The rest LGTM, thanks.

#23 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

A test for the "cancel remaining child jobs via requesting_container_uuid when parent ends" behavior would still be good (unless it's here and I'm just not seeing it?). The rest LGTM, thanks.

I had just such a test, and it got lost in the confusion about process trees from note-18. Restored the test.

#24 Updated by Peter Amstutz almost 4 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:a1adf1ed6f93ce0769f307a86b6389e9e8e630a9.

Also available in: Atom PDF