Feature #14706

[Crunch2] Retain references + permissions to earlier containers when retrying a container request

Added by Peter Amstutz 5 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

The container request record lists the most recent container attempted to fulfill the request. This means when a cancelled container is retried, the earlier cancelled containers are not visible to the user: the container UUID is no longer mentioned in the container request record, which means that even if the client remembers the UUID, the user no longer has permission to retrieve the container record.

(See #14870 for the related problem that the logs from previous attempts are not preserved in the container request's log collection.)

Proposal:

Need a column that has uuids of all containers. Can use array column, eg https://www.postgresql.org/docs/9.6/arrays.html, or JSONB column.

Current data model has "container_uuid" as a singular value. It would be a backwards compatibility problem if that changed to be an array. API should report past attempts in a separate field, like "past_container_uuids".

Unclear if it would be better in the underlying database to have a single array column (where first/last item is always the most recent attempt), or retain container_uuid column and add a past_container_uuids column.

Need to be able to join array column to grant read permission to container records. Section 8.15.5 of postgres docs suggest this is something like:

container.uuid = ANY (container_request.past_container_uuids)


Related issues

Related to Arvados - Feature #8018: [Crunch2] Identify container failure and retryResolved09/23/2016

Related to Arvados - Story #14870: [API] Access logs from previous attempts after auto-retrying a container requestResolved03/01/2019

History

#1 Updated by Peter Amstutz 5 months ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz 5 months ago

  • Description updated (diff)
  • Status changed from In Progress to New

#3 Updated by Peter Amstutz 5 months ago

  • Tracker changed from Bug to Feature

#4 Updated by Peter Amstutz 3 months ago

  • Description updated (diff)

#5 Updated by Tom Clegg 3 months ago

  • Related to Feature #8018: [Crunch2] Identify container failure and retry added

#6 Updated by Tom Clegg 3 months ago

I'm not sure adding an array of container UUIDs to the container_requests table would solve this problem. Often the most valuable troubleshooting information is in the log files, which would still be inaccessible.

It might be more useful to focus on preserving all relevant logs in the container request's log collection, even if they span multiple containers. Perhaps the API server could merge the logs: e.g., instead of replacing the CR's entire log collection when the container's log is updated, just copy the container's log files into a "container ${uuid}" subdir in the container request's log collection. This would disturb existing scripts/users who expect the log files to be at the top level, but it would be compatible with multiple concurrent containers (e.g., speculative retry without killing, and replication>1 service containers).

This also helps in the case where the container record itself is really what's wanted, since that is included in the container's log collection. (There are currently some exceptions -- e.g., a log collection isn't created at all when a container doesn't fit any instance type -- but those could be fixed.)

It's also worth addressing the permission issue, at least for admins (currently even the dispatcher isn't allowed to see that a container has state=Cancelled if all matching CRs have had different containers assigned!). If we need to do it for users, we should consider the performance implications of an array vs. a separate table to express the many-to-many relationship.

#7 Updated by Tom Clegg 3 months ago

One more refinement: Put a copy of the latest container's logs in the root dir of the container request's log collection, in addition to a subdir named after the container UUID. This way, existing scripts continue to work on new logs.

#8 Updated by Tom Clegg 3 months ago

  • Related to Story #14870: [API] Access logs from previous attempts after auto-retrying a container request added

#9 Updated by Tom Clegg 3 months ago

  • Subject changed from [Crunch2] Retain record of container retries to [Crunch2] Retain references + permissions to earlier containers when retrying a container request
  • Description updated (diff)

Also available in: Atom PDF