Bug #9848
closed[Crunch2] [API] Ensure saved logs and outputs from containers are accessible to non-admin users
Description
crunch-run saves container logs (and outputs) with owner_uuid = uuid of the dispatch user.
Currently, this means only admins are able to read them.
A user who submits a container request should be able to read the log (and output) of the corresponding container.
Related issues
Updated by Tom Clegg about 8 years ago
Possible approach:
In ContainerRequest#container_completed! in source:services/api/app/models/container_request.rb, create new log and output collections with the same owner_uuid as the container request itself (copying manifest_text from the container log/output).
In crunch-run, set expires_at=(now + permissionTTL) when saving log and output collections (otherwise, all logs and outputs will use up storage space forever, even after users delete their copies).
Updated by Tom Morris about 8 years ago
- Story points set to 1.0
- Target version set to Arvados Future Sprints
Updated by Tom Clegg about 8 years ago
In the container re-use code, we'll need to check whether the log is still available (as we currently do with outputs in crunch1)
Updated by Tom Clegg about 8 years ago
- Subject changed from [API] Ensure saved logs and outputs from containers are accessible to non-admin users to [Crunch2] [API] Ensure saved logs and outputs from containers are accessible to non-admin users
Updated by Tom Clegg about 8 years ago
- Target version changed from Arvados Future Sprints to 2016-09-28 sprint
Updated by Peter Amstutz about 8 years ago
- The log checking code:
readable_pdh = Collection. readable_by(current_user). select('portable_data_hash') completed = completed. where("log in (#{readable_pdh.to_sql})"). order('finished_at asc'). limit(1)
This kind of hurts my head. I get this is using a subquery, so we're assuming the postgres query planner will DTRT. However, if it does not optimize the query, it could inefficiently enumerate every collection visible to the user. Can we verify that it is going to be smart and execute the query in a way that ensures that it uses the portable_data_hash index?
- Right above the new code I noticed this line (from Lucas):
completed = candidates.where(state: Complete).where(exit_code: 0)
I don't think we want to filter on exit code, since a deterministic container would always return the same exit code? Also a few tools like "grep" return non-zero codes for conditions that are not errors.
- It appears that when job reuse happens, the container_request state remains in Committed (which means it won't create the output/log collections). I think there needs to be an additional hook for when the container request is in Committed state which checks to see if the referenced container is in Complete/Cancelled state and calls container_completed!
Updated by Tom Clegg about 8 years ago
Peter Amstutz wrote:
- The log checking code:
[...]
This kind of hurts my head. I get this is using a subquery, so we're assuming the postgres query planner will DTRT. However, if it does not optimize the query, it could inefficiently enumerate every collection visible to the user. Can we verify that it is going to be smart and execute the query in a way that ensures that it uses the portable_data_hash index?
"Explain analyze" on a small database doesn't necessarily tell us very much about real-world performance (postgres chooses seq scan over an index when it knows the table is small) but here it is:
arvados_production=# explain analyze SELECT "containers".* FROM "containers" WHERE (log in (SELECT portable_data_hash FROM "collections" WHERE (expires_at IS NULL or expires_at > statement_timestamp()))) ORDER BY finished_at asc LIMIT 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1136.43..1136.43 rows=1 width=2299) (actual time=0.031..0.031 rows=0 loops=1) -> Sort (cost=1136.43..1136.43 rows=1 width=2299) (actual time=0.031..0.031 rows=0 loops=1) Sort Key: containers.finished_at Sort Method: quicksort Memory: 25kB -> Hash Join (cost=1072.28..1136.42 rows=1 width=2299) (actual time=0.014..0.014 rows=0 loops=1) Hash Cond: ((collections.portable_data_hash)::text = (containers.log)::text) -> HashAggregate (cost=1071.16..1117.80 rows=4664 width=36) (never executed) Group Key: (collections.portable_data_hash)::text -> Seq Scan on collections (cost=0.00..1047.53 rows=9455 width=36) (never executed) Filter: ((expires_at IS NULL) OR (expires_at > statement_timestamp())) -> Hash (cost=1.05..1.05 rows=5 width=2299) (actual time=0.005..0.005 rows=0 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 0kB -> Seq Scan on containers (cost=0.00..1.05 rows=5 width=2299) (actual time=0.003..0.005 rows=5 loops=1) Planning time: 1.122 ms Execution time: 0.149 ms (15 rows)
The more obvious solution is to get batches of containers and do the readable_by query manually until one is readable, which seems messier and as likely to be slower as anything. I think by default we should let Postgres do this sort of thing for us, and only poke it in the eye on an as-deserved basis.
- Right above the new code I noticed this line (from Lucas):
[...]
I don't think we want to filter on exit code, since a deterministic container would always return the same exit code? Also a few tools like "grep" return non-zero codes for conditions that are not errors.
It's common for a container to exit non-zero after a fatal transient error, like OOM or a network problem that prevents it from reading its input. I don't think it would make sense to reuse those outputs, or disqualify all successful containers with identical outputs just because there was also one attempt that produced different output because it failed.
- It appears that when job reuse happens, the container_request state remains in Committed (which means it won't create the output/log collections). I think there needs to be an additional hook for when the container request is in Committed state which checks to see if the referenced container is in Complete/Cancelled state and calls container_completed!
Yes, presumably committed requests should get finalized immediately if the container is already finished. I guess we didn't notice this in #9623? We should add a test. OK to merge this in the meantime?
Updated by Tom Clegg about 8 years ago
On the topic of exit codes, surely the only sane approach is to follow the rule: exit 0 means success, exit non-zero means failure. Programs that are broken (e.g., a shell script that says "grep x y" where it should say "grep x y || true") are broken. If we start second-guessing exit codes, we'll just never know whether anything succeeded, even programs that aren't broken...
Updated by Radhika Chippada about 8 years ago
- Target version changed from 2016-09-28 sprint to 2016-10-12 sprint
Updated by Tom Clegg almost 8 years ago
9848-finalize-on-reuse
Updated by Tom Clegg almost 8 years ago
- Target version changed from 2016-10-12 sprint to 2016-10-26 sprint
Updated by Tom Clegg almost 8 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:06f40b3d477f3ab1c834d9671808c3da02ca7dfd.